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INTRODUCTION 


This  Chapter  presents  a  very  idiosyncratic  view  of 
multivariate  analysis  and  reflects  what  the  author  has 
found  useful  in  his  statistical  practice.  It  stresses 
exploration,  virtually  ignores  tests  of  significance,  and 
emphasizes  a  particular  graphical  technique  developed  by 
the  author  —  biplot  display.  The  author  hopes  that  the 
Chapter  will  help  its  readers  towards  a  better  grasp  of 
the  structure  of  multivariate  data  and  the  fundamentals  of 
multivariate  analysis.  He  hopes  that,  despite  the  Chapter's 
personal  bias,  it  will  also  help  readers  who  will  wish  to 
pursue  multivariate  analysis  in  its  more  classical  form,  for 


1.  ONE  BATCH  OF  MULTIVARIATE  DATA  AND  THEIR 
DESCRIPTIVE  STATISTICS 

1.1.  A  multivariate  data  matrix 

The  essence  of  multivariability  is  that  several  variables 
are  observed  on  each  unit.  Thus,  if  the  units  are  days,  one 
might  observe  maximum  and  minimum  temperatures  on  each  day , 
as  well  as  precipitation  and  surface  biometric  pressure  at 
6  a.m.,  12  noon,  6  p.m.  and  midnight;  these  would  be 
7-variate  observations.  It  is  convenient  to  think  of  the 
data  as  a  matrix Z  ^nxm)  in  which  row  z'^  (i=l,...,n) 
contains  the  m— variate  observations  ^or  uniu.  i  — —  out  of 
n  units  —  and  each  column  z^^j  (v— 1,  .  .  .  ,m)  contains  all 
n  units'  observations  on  the  v-th  variable  and  zj_jV  is 
unit  i's  observation  on  variable  v.  Thus,  in  this  example, 
element  z_  would  be  the  total  precipitation  on  day  2, 
z'2  would  be  the  seven  variate  observations  on  day  2  and 
z  ^  the  n  days'  observations  on  the  third  variable  --  total 
precipitation . 

We  adopt  the  convention  of  denoting  a  matrix  by 
a  Latin  capital  letter  and  any  of  its  elements  by  the 
corresponding  lower  case  letter  with  two  indices,  which 
indicate,  respectively,  the  row  and  column  in  which  the 
element  is  located.  We  denote  both  rows  and  columns 
of  the  matrix  by  the  lower  case  letter  underlined  and  with 
a  single  index:  if  the  index  is  in  parentheses,  a  column 
is  denoted;  if  no  parentheses  are  shown,  a  row  is  indicated. 
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For  a  detailed  illustration,  consider  the  data  of 
Table  1:  mean  monthly  temperatures  for  20  stations  during 
6  months  of  the  year  1951.  Here,  n=20,  m=6,  and  z,  ^ 
is  the  third  station's  mean  January  temperature,  whereas 
^  is  the  first  station's  mean  May  temperature.  The 
location  of  the  20  stations  is  shown  on  the  map  of  Figure  1. 

A  first  glance  at  the  data  matrix  like  this  is  apt 
to  be  somewhat  confusing.  Some  idea  of  the  general 
pattern  can  be  obtained  from  the  means  and  standard 
deviations  —  shown  at  the  bottom  of  Table  1.  The 
temperature  averages  are  seen  to  be  much  the  same  in  all 
the  six  months  (evidently  because  the  stations  are  spread 
on  both  sides  of  the  equator) .  However,  there  is 
considerable  variation  from  station  to  station,  as  evidenced 
by  the  large  standard  deviations:  these  are  around  50 
(i.e.,  5  degrees  centigrade)  for  each  month,  though  a  bit 
less  for  March  and  May. 
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TABLE  1 

Mean  Monthly  Temperatures  at  Certain  American  Stations,  1951 

(10  x  centigrade) 
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1.2.  Summary  statistics  for  the  variables'  configuration 

The  common  statistics  for  the  batch  of  n  units  are 

readily  obtained  from  matrix  2.  Thus,  the  means  z-, ,"z.  . 
1  -  (1)  (m) 

of  all  m  variables  are  arrayed  in  vector 

I'  =  ( 1/n)  l^Z,  (1.1) 

where  1  is  a  vector  of  n  ones.  Deviations  from  each 
— n  - 


variable's  mean  are  given  in  matrix 

Y  “  Z  *  ini'  ' 

which  has  typical  element 


:i.2) 


Yi,v  ~  Zi , v  Z(v)  (1.3) 

The  means  for  the  temperature  illustration  were  noted 
in  Table  1,  and  the  deviations  y.  from  the  means  are  shown 

1  /  V 

in  Table  2.  Thus,  y^  3  =  7.9  =  224  .0-  231.9  =  z]_  3  ~  z(3)* 

From  these  one  may  compute  the  variance  matrix  (often 

referred  to  as  the  variance  -  covariance  matrix) 

S  =  \  Y’Y,  (1.4) 

the  standard  deviations 

sv  =  /s  v  (v=l , . . . ,m) ,  (1.5) 

and,  defining  the  diagonal  matrix  with  elements  d  =  s 

v,v  v 

as  Ds,  the  cor relation -matrix 


whose  elements  are  the  correlations  r v  v,  between  variables. 

For  the  temperature  illustration,  the  standard  deviations 


were  shown  in  Table  1,  and  the  variance  and  correlation 
matrices  are  given  in  Tables  3  and  4,  respectively. 
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TABLE  2 

Deviations  from  Monthly  Means  -  Data  of  Table  1 
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TABLE  4 

Correlation  Matrix  of  Mean  Temperatures  - 
Data  of  Table  1 
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The  configuration  of  correlations  in  an  (mxm)  matrix 
R  may  not  be  easy  to  grasp  at  first,  especially  if  m  is  10 
or  more.  In  the  present  illustration  with  m=6,  one  may  begin 
to  study  Table  4  by  concentrating  on  the  highest  correlations. 
One  notices  two  distinct  sheaves  of  months:  November,  January 
and  March  are  highly  intsrcorrelated  and  so  are,  even  more 
strongly,  May,  July  and  September.  The  correlations  between 
months  not  belonging  to  the  same  sheaf  (or  season)  are  seen 
to  be  much  lower . 

Such  a  perusal  of  correlations  is  not  always  easy, 
especially  if  the  variables  do  not  group  neatly  into 
highly  inter-correlated  sheaves.  It  is  sometimes  helpful 
also  to  consider  the  inverse  of  the  variance  matrix,  i.e., 


S“X  =  <i  Y'Y)'1, 


(1.7) 


V  V 

because  its  elements  s  '  have  the  following  interpretation 


in  terms  of  the  multiple  regression  coefficients  of  the  v-th 


variable  on  all  other  variables.  Take  the  v-th  row  of  S  , 


divide  each  off-diagonal  element  by  the  diagonal  element  and 
change  sign  -  then 

XT  XT  *  TV  XT 

(1.8) 


b  ,  =  sv'v’/sv'v 

V,  V  ’ 


is  the  coefficient  of  variable  v1  in  the  regression  for 
variable  v.  Furthermore,  using  diagonal  terms  from  both  S  and  5 


2  .  ,  v, v, -1 

r  =  1  -  (s  s') 
v  v,  v 


1.9] 


gives  the  multiple  correlation  of  variable  v  on  all  m-1 
other  variables. 


For  the  temperature  illustration,  the  inverse  S 


-1 


and  the  multiple  correlation  and  regression  coefficients  are 
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given  in  Tables  5  and  6,  respectively.  Each  month's 
temperature  is  seen  to  be  pretty  highly  correlated  with 
the  temperatures  of  all  other  months.  And  of  course  the 
multiple  correlations  are  all  higher  than  the  correlations 
with  individual  variables,  that  is. 


r  >  r  ,  (v '  /  v)  . 
v  -  v, v‘ 


(1.10) 


It  is  interesting  to  see  the  pattern  of  regression  coefficients 
Each  month's  coefficients  with  adjacent  months  are  positive 
but  with  months  about  half  a  yea’r  away  (2  or  3  variables 
away  in  the  circular  order  ...123456123...)  the  coefficients 
are  negative.  This  makes  good  sense  in  terms  of  consistent 
seasonal  patterns. 
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1.3.  Distances  in  the  units'  scatter 

The  above  statistics  describe  the  configuration  of  the 
variables  (monthly  temperatures)  for  the  entire  batch  of  units  -- 
no  attention  being  paid  to  the  individual  units  (stations) . 

If  one  is  interested  in  the  individual  units,  their 
similarities  and  differences,  one  needs  a  decription  of 
the  scatter  of  the  units .  For  this  purpose  one  would 
calculate  the  units'  metric 


^ (nxn)  ‘  VS"ly'  >  (1-U> 

which  can  be  interpreted  in  terms  of  standardized  distances  as 
follows:  The  diagonal  elements  of  U 

u.  =  y!  S_1  y. 

=  (li  -  D  ’S"1  (z^  -  z)  ,  (1.12) 


are  squares  of  standardized  distances  of  units  i  from  the 
centroid,  i.e.,  from  the  multivariate  mean  of  the  batch. 
The  tetrad  differences 


=  Vz±  -  Zg)  'S"x(z.  -  Zg)  f  (1.13) 

are  squares  of  standardized  distances  between  units  i  and 
e.  Such  distances  should  be  understood  as  measuring 
statistical  differences  simultaneously  on  all  m  variables.  They 
are  equal  to  zero  if  and  only  if  the  units  compared  have  equal 
observations  on  all  variables,  and  they  increase  when  the 
differences  in  any  one  or  more  variable  becomes  larger. 


Inter-station  Standardized  Distance  -  Data  of  Table 
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The  matrix  of  standardized  distances  /d .  between  each 

i,e 

pair  of  stations  on  the  six  month's  temperature  data  is 

given  in  Table  7.  All  distances  are  positive  except  the 

"self-distances"  in  the  diagonal  which  are  identically 

zero;  the  matrix  is  symmetric,  that  is,  the  i-to-e  distance 

/a.  equals  the  e-to-i  distance  /d  ..  Small  distances, 
i ,  e  1  e  ,  l 

such  as  d  Q  =  0.6,  indicate  that  stations  8  and  9 
o ,  y 

have  very  similar  mean  monthly  temperatures;  whereas 
large  distances,  such  as  d,  =  5.3,  show  that  very 
considerable  differences  in  mean  monthly  temperatures  exist 
between  stations  2  and  3.  (The  reader  can  verify  this  from 
Table  1.) 

It  is  difficult  to  inspect  a  table  of  distances  of 
this  magnitude  (not  to  speak  of  distance  matrices  for 
a  hundred  or  more  units) .  We  shall  therefore  require 
methods  of  disentangling  the  pattern  of  distances  of  a 
scatter  of  units  and  of  making  some  sense  of  such  a  distance 
matrix  —  these  will  be  discussed  below  in  Section  4. 


1.4.  Some  further  remarks  on  standardized  statistical  distances 


To  understand  the  u  and  d  statistics  it  is  well  to 
begin  by  considering  standardized  difference  between  units 
i  and  e  on  one  particular  variable.  Thus  on  the  v-th 
variable  alone  the  distance  would  be 


/d  . 


i,e(v) 


=  |y 


1,  V 


-  y  |  /Vs 

*  e,  v  1 '  v,  v 


(1.14) 


Similarly,  for  the  linear  combination  of  variables  (LCV) 
with  coefficients  a  =  (a. , . . . ,am) '  the  standardized 
difference  would  be 


✓d 


i,e (a) 


la'ii  “  a,ye|//a,Sa 


(1.15) 


since  the  variance  of  that  LCV  is  a'Sa.  A  generalized 
i-to-e  distance,  for  all  variables  and  LCVs  together,  can 
then  reasonably  be  defined  as  the  maximum  of  all  such  LCVs’ 
differences.  But  it  can  be  proved  that  this  maximum 
satisfies 


max 

/ d.  .  .  =  /d. 

al' ' * ‘ ' am  1,e (— )  1,e  ' 


(1.16) 


so  that  the  proposed  generalized  i-to-e  distance  of  (1.13)  can  be 

regarded  as  a  maximum  difference  over  all  variables  and 

LCVs. 

A  similar  explanation  can  be  given  for  the  structure 

of  the  standardized  distance  /u.  .  to  the  centroid. 
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1.5.  What  are  the  units  and  what  are  the  variables? 

Statistics  textbooks  usually  treat  only  the  description 
of  the  variables'  configurations  and  ignore  that  of  the 
units'  scatter.  This  is  presumably  because  statisticians 
have  mostly  been  concerned  with  random  samples  in  which  the 
individual  units  are  of  no  interest  in  themselves.  In 
practical  data  analysis,  however,  the  units  are  often  of 
real  interest  and  their  description  is  as  relevant  as  that 
of  the  variables.  Indeed,  it  is  not  always  obvious  which 
of  the  classifications  of  data  one  wants  to  regard  as  units 
and  which  as  variables.  In  the  example  mentioned  above, 

time  has  appeared  as  a  variable  -  but  in  a  series  of 
successive  observations  or  giv»n  stations  or  measurements  it 
might  appear  as  a  unit. 

In  any  particular  application,  the  decision  of  what  to 
regard  as  units  and  what  as  variables  will  determine  what  will 
be  weighted  equally  and  what  will  be  standardized.  In  the 
analyses  discussed  above,  the  treatment  of  the  rows  and 
columns  of  data  matrix  Z  is  asymmetrical  -  columns  are 
correlated  with  equal  weight  attached  to  each  row  (station) ; 
rows  are  compared  by  distances  standardized  with  respect  to  the 
different  variables  (months) . 


wdkHfeMHiiMM 
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2.  THE  GEOMETRY  AND  DISPLAY  OF  A  BATCH  OF 
MULTIVARIATE  DATA 

2.1.  The  configuration  of  variables 

We  begin  by  considering  the  conf iguration  of  a  batch, 

in  terms  of  its  means,  standard  deviations  and  correlations. 

We  take  it  that  the  reader  knows  how  to  interpret  each  one 

of  these  measures,  but  that  he  may  be  bewildered  by  the 

magnitude  of  a  correlation  (or  variance)  matrix  and  may 

20 

need  guidance  to  make  any  sense  of  the,  say,  (  ^  ~  190 
correlations  from  a  20-variate  data  batch.  We  will,  there¬ 
fore,  provide  a  method  of  representing  such  a  configuration 
and  show  an  example  of  interpreting  it.  For  brevity,  we  will 
illustrate  this  on  six-variate  data. 

Geometry  is  most  useful  in  grasping  the  structure  and 
patterns  of  multivariate  data.  One  may  think  of  the  m 
variables  as  m  vectors  emanating  from  one  center  the 
centroid  of  the  data  —  such  that  (i)  the  length  of  each 
vector  is  proportional  to  the  standard  deviation  of  the 
corresponding  variable,  and  ( i i )  the  cosine  of  the  angle 
between  any  two  vectors  is  the  correlation  between  the 
corresponding  two  variables.  In  fact /  it  follows  from  (1.4) 
and  (1.5)  that 

II  L(v)  H  =  //fr  sv  '  (2*1) 

and  from  (1.6)  that 


v ,  v 


cos  (J£(v)  -  Z(v.)>  =  r. 


(2.2) 
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2.2.  Approximation  of  the  variables'  configuration  in  the  plane 

Geometric  conceptualization  in  hyper-space  may  not 
be  to  everyone's  taste,  but  an  approximate  representation 
in  the  plane,  or  in  3D,  is  often  quite  useful  in  revealing 
many  of  the  features  of  a  configuration.  To  illustrate, 
consider  the  variances  of  Table  3  and  the  approximate  repre¬ 
sentation  of  their  configuration  by  the  arrows  of  Figure  2  — 
This  display  is  called  a  biplot .  The  method  of  approximation 
will  be  discussed  later  —  subsection  2.3,  below.  Suffice 
it  to  say  now  that  the  goodness  of  fit  of  this  planar  display 
is  96.7%  for  the  temperature  illustration,  so  that  little 
of  interest  could  have  been  lost  by  reducing  this  con¬ 
figuration  to  the  plane.  (The  dots  on  Figure  2  represent 
the  stations  --  more  about  that  later  —  subsection  2.6) 

(  (Figure  2  about  here)  ) 

The  configuration  of  the  arrows  in  the  biplot  of  Figure 
2  is  particularly  simple.  The  length  of  the  arrows  are 
pretty  similar,  indicating  similar  variabilities  of  all  months; 
but  March  and  May  arrows  are  the  shortest,  since  the  standard 
deviations  on  these  months  are  least  --  see  Table  1.  All 
arrows  are  within  the  quadrant  formed  by  those  for  January 
and  July.  The  angle  separating  the  latter  two  arrows  is 
close  to  90°:  this  indicates  virtually  zero  correlation 
between  these  two  months  (Table  4  shows  this  correlation  to 
be  -.09,  which  is  indeed  negligible) .  In  between  these  two 
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are  all  other  months,  with  smaller  angles,  indicating  positive 
correlation.  One  may  describe  the  entire  configuration  roughly 
by  two  sheaves  of  arrows:  a  Fall-Winter  sheaf  of  arrows 
separated  by  small  angles,  i.e.,  highly  correlated  (with 
March  and  November  being  particularly  highly  correlated) , 
and  a  Spring-Summer  sheaf  with  slightly  greater  angles,  i.e., 
less  highly  correlated.  This  despription  will  be  noted  to 
accord  completely  with  that  obtained  from  Table  4,  above. 

The  practical  usefulness  of  the  biplot  is  more  evident 
when  the  number  of  variables  is  larger.  In  that  case  it 
is  much  easier  to  see  patterns  and  sheaves  on  the  biplot 
than  by  inspection  of  the  matrix  of  correlations.  It  is 
a  matter  of  not  seeing  the  wood  (configuration)  for  the 
trees  (correlations)  because  there  are  so  many  of  the  latter. 

An  important  function  of  multivariate  data  analysis  is  to 
provide  such  simple  descriptive  tools  to  allow  the 
investigator  to  make  sense  out  of  the  mass  of  correlations 
and  other  data  spewed  out  by  modern  computers. 


» 
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2.3.  Computation  of  planar  approximations 

The  method  of  obtaining  the  planar  approximation  of 
the  variables'  configuration  is  to  solve 

Y'Yq  =  A2q  (2.3) 

2  2 

for  the  largest  two  eigenvalues  A^  >_  A ^  and  their  associated 
eigenvectors  (normalized  to  length  one).  One  then 

forms  matrix 

H(mx2)  =  (XlSi'X2S2),  (2-4) 

whose  rows  h^,...,h^  are  plotted  as  arrows  emanating  from 
a  common  origin.  This  method  is  equivalent  to  least  squares 
fitting  and  its  goodness  of  fit  can  be  gauged  by  coefficient 

X[2]  =  (X1  +  X2)/tr  (Y’Y Y’Y) 

=  1  -  |  |  Y  *  Y  -  HH  ’  |  !  2  /  |  I.  Y  ’  Y  |  |  2  .  (2.5) 

It  provides  the  approximations 

|  |hv)  j  apx  /n~  sv  (2.6) 

and 

cos  (h  ,h  ,)  apx  r  ,  (2.7) 

corresponding  to  (2.1)  and  (2.2),  above. 
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2.4.  Approximation  of  the  units’  scatter  in  the  plane 

The  foregoing  calculations  are  equivalent  to  those  of 
the  first  two  principal  components,  a  fact  which  will  be 
commented  on  later,  and  they  also  lead  to  a  useful  represen¬ 
tation  of  the  units  in  terms  of  their  statistical  scatter. 
One  forms 


Fte<2)  ’  ‘hV  <2- 

and  computes  matrix 

G  ( nx  2 )  =  Y  F  '  l2‘ 

whose  rows  c J , . . . , a ^  are  plotted  as  points.  The  distances 
between  the  plotted  points  then  provide  an  approximate 
representation  of  the  standardized  statistical  distances 
between  the  corresponding  units,  that  is, 


as  well  as 


apx 

I  IS.i  -  £[11  *  /di,e'/Vn 


apx 

I  IsJ  I  '=  /ui,j/vnT 


(2.10) 


(2.11) 


The  coefficients  obtained  by  performing  these  calcula¬ 
tions  on  the  temperature  data  are  shown  in  Table  8. 

The  interpretation  of  such  a  g-scatter  is  obvious. 
Distant  points  represent  units  which  are  statistically 
dissimilar;  points  close  together  represent  statistically 
similar  units;  clusters  of  points  represent,  groups  of 
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TABLE  8 


Biplot  (and 

Bimodel ) 

Coordinates 

For  Terr.pe 

nature  Da 

ta 

Hi 

198.05 

-169 . 37 

-29.23 

Hi 

.  0010 

-.0017 

- . 0056 

Hi 

190.76 

-62.32 

-16. 19 

Hi 

.0010 

- . 0006 

-.0031 

Hi 

162.02 

85 . 17 

-25.32 

Hi 

.  0008 

.  0009 

-.0049 

Hi 

128.44 

138 .45 

-13.20 

Hi 

.  0007 

.0019 

-.0025 

Hi 

163.86 

135 . 40 

17 .81 

f  i 

.  0009 

.0014 

.  0034 

Hi 

215.63 

-68.44 

54.51 

Hi 

.0011 

-.  0007 

.0104 

.  0206 

-.1865 

-.1004 

A,  = 

437.841 

191704 

3.2 

.  1605 

-  .  0533 

.1526 

'  X2  = 

313.613 

L2  * 

98353 

-2 

-.0167 

-  .  4055 

-.3292 

X  3  = 

72.247 

5220 

a\ 

-.  2971 

-.  2990 

.  3781 

a'. 

-.  3549 

-.2212 

.  0300 

26 

.  1572 

-.0923 

-.1148 

2? 

.  2279 

-.015  5 

-.0235 

28 

.  1604 

-..0025 

.1362 

29 

.  1143 

-.0130 

.  0810 

2io 

.  1498 

- . 0227 

.0190 

2ii 

-.5509 

-.1157 

.1811 

2l2 

-.1000 

-.1692 

-.5192 

2l3 

.  1459 

.0179 

.1099 

2i4 

.  2273 

-.0004 

.0124 

2l  5 

.  2092 

.  0362 

#-.1138 

2l6 

.  1897 

.  1254 

.0817 

2i7 

.  0294 

.  2422 

-.0913 

2ia 

.0504 

.  2036 

.  5040 

2l9 

- . 1752 

.  4054 

-.2196 

22  0 


-.3480 


.5711 


1744 


Inter-station  Biplot  Distances 
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statistically  similar  units;  sets  of  points  ordered  across 
the  plot  represent  units  differing  in  a  systematic  sequence, 
etc  . 

Each  station  i  is  represented  by  its  appropriate 
vector  on  the  biplot  of  Figure  2.  Note  that  different 
scales  can  be  used  for  the  g_'s  and  the  h’s  —  though  for 
each  of  them  its  horizontal  and  vertical  scales  must  be 
the  same. 

Figure  2  shows  that  the  scatter  of  g  points  mimics  to  some 
extent  the  geographical  spread  of  the  stations  —  see  map  in 
Figure  1.  Thus,  the  Northernmost  stations  appear  on  top  of 
the  biplot  with  a  clear  diagonal  trend  associated  with 
latitude.  Stations  on  or  near  the  Northern  coast  of  South 
America  form  a  tight  cluster  (high  degree  of  statistical 
similarity)  whilst  stations  farther  south  and  west  in  South 
America  trail  out  towards  the  lower  left  of  the  biplot.  If 
we  split  the  stations  into  four  geographically  contiguous 
groups,  we  should  expect  greater  homogeneity  of  temperature 
profiles  within  each  group  and  considerable  inter-group 
differences.  Table  10  groups  the  inter-station  distances  of 
Table  7  accordingly  and  this  confirms  that  the  biplot  cluster¬ 
ing  does  produce  relatively  homogeneous  groups. 

Clearly,  geographical  proximity  is  associated  with 
similarity  in  annual  temperature  profiles.  But  this 
association  is  not  perfect,  as  witness  to  the  fact  that  the 
west  coast  stations  11  and  12  are  more  similar  to  southern 
stations  1,3, 4, 5  than  to  stations  9  and  15  which  are  closer 
by. 
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TABLE  10 


Median  Distances  Within  and  Between  Groups 


From 

Group 

_ Ifi _ Group _ 

Stations 

I 

!  n  i  hi 

!  IV 
» 

I 

3.3 

'  1 

1  3.6  |  4.1 

!  4-3 

1,3,4,5,11,12 

II 

3.6 

,  2.4  |  3.0 

1 

1  3.8 

2,6,7,8,9,10,13,14,1 

III 

4.1 

1  3.0  1  2.0 

1  I 

!  1.9 

16,17,18 

IV 

4.3 

|  3.8  |  1.9 

2.8 

19,20 
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2.5.  Plotting  of  extra  data  points 

The  biplot  has  been  constructed  to  display  a  data  matrix 
Z  about  its  centroid  F' ,  that  is,  it  represents  z'  at  its 
origin  and  displays  deviations  Y.  At  times  one  may  wish  to 
display  further  data  points  that  were  not  fitted  in  obtain¬ 
ing  the  biplot.  Thus,  one  may  have  an  additional  unit  with 
m-variate  observations  .  This  will  be  centered  as  deviation 
yl  =  2J.-Z'  and  its  biplot  coordinates  calculated  as 


20  =  Zo  F 

or 

=  (£0 -I,)p 


(2.12a) 


(2.12b) 


To  illustrate,  one  might  consider  a  hypothetical 
station  whose  temperatures  for  each  month  were  exactly 
one  standard  deviation  above  the  mean,  i.e., 


z^  =  (284.68,  274.58,  273.46,  281.42,  287.14,  280.65). 

Calculation  of  (2.12)  yields  biplot  coordinates  g^  = 
(.2748,  .0379).  Such  a  point  would  appear  in  the  biplot  — 
Figure  2  --  slightly  to  the  right  of  g ^ .  Indeed,  the 
temperatures  for  station  15  are  similar  but  slightly 
smaller  than  those  of  this  hypothetical  station. 


The  biplot  (Gabriel,  1971)  displays  both  the  configura¬ 
tion  of  the  months  (variables  -  columns  of  data  matrix  Z) 
and  the  scatter  of  stations  (units  -  rows  of  2) ,  Because 
it  displays  them  jointly  it  is  called  a  biplot  --  and  this 
simultaneous  representation  allows  more  insight  into  the 
data  than  could  be  obtained  from  the  separate  inspections 
of  variables  (subsection  2.3}  and  of  rows  (subsection  2.4) 
which  have  been  illustrated  above. 

The  biplot  displays  the  actual  deviations  y.  =  z. 

J  i, v  i, v 

by  inner  products 

yi/V  2E1  2ihv  '  (2.13) 

with  goodness  of  fit 

(2)  2  2 
X  =  (Xf  +  x;-)/tr(Y'Y) 

[2] 

=1-| | Y-GH ' | l2/! |Y| | 2  .  (2.14) 

In  other  words,  the  deviation  for  station  i  on  variable  v 

can  be  visualized  as  the  length  of  h^  times  the  length  of 

the  projection  of  (considered  as  a  vector  from  the 

origin)  onto  h_v  --  the  sign  of  the  lengths'  product  being 

positive  or  negative  acc  .rding  to  whether  g  V  s  projection 

onto  hv  is  in  the  same  or  opposite  direction  to  h^  itself. 

Clearly,  then,  a  station  i  whose  g^  is  far  out  in  the 

direction  of  (opposite  to)  the  vector  hv  of  a  variable 

v  has  large  positive  (negative)  deviat.on  y.  .  When  the 

1  ,  v 

g^ '  s  are  less  far  out  in  the  h^  direction  (or  opposite 
it)  the  deviations  are  smaller. 
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Figure  3  displays  the  January  arrow  and  the  station 
1  and  station  20  points  g^  and  c^q  of  the  biplot  of  Figure  2. 

It  also  shows  the  orthogonal  projection  of  these  two  points 

onto  the  line  through  h ^ .  The  projection  of  £^  is  seen  to  be 

of  length  0.132  in  the  direction  of  h^  --  whose  length  is 

256.  Hence  the  biplot  approximation  of  y.  =  34.4  is 

cj^h-^  =  .  132  x  256  =  33. S.  Similarly,  the  projection  of 

£2q  onto  the  line  through  h^  is  of  length  0.625  in  the  direction 

opposite  h^.  Hence  the  biplot  approximation  of  V2Q  ^  =  -157.6 

is  220—1  =  x  256  =  -160  (the  minus  sign  being  attached 

because  the  projection  is  opposite  the  vector  projected 

upon)  . 

This  relation  between  £^  points  and  hv  arrows  is  useful 
in  interpreting  the  scatter  of  £  points.  Thus,  one 
may  identify  the  variables  (months)  on  which  a  cluster  of 
units  (stations)  is  particularly  large  or  small. 

As  an  example,  we  note  the  northernmost  stations  in  Figure  2 
to  be  aligned  in  a  direction  opposite  the  Fall-Winter 
sheaf.  Evidently,  the  farther  north  the  station,  the  lower 
its  Fall-Winter  temperatures.  (This  is  readily  c:r.f irrr.-d 
by  inspecting  the  last  four  or  five  rows  of  Table  2.)  On 
the  other  hand,  the  difference  between  the  second  and  third 
clusters  of  stations  is  associated  with  the  direction  cf 
the  Spring-Summer  temperatures:  the  nortn  coast  stations 
have  higher  Spring-Summer  temperatures  than  the  western  and 
southern  South  American  stations.  (Again,  Table  2  confirms 
this  pattern.) 
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At  this  stage  it  is  well  to  realize  that  one  can 
inspect  the  biplot  also  for  linear  combinations  of  variables 
beyond  the  actual  variables  of  the  data  displayed.  One 
may  do  this  by  vector  addition  of  h  arrows.  Thus,  for 

example,  a  March  plus  May  sum  would  be  represented  by  the 

vector  h_2  +  h.3  which  is  readily  constructed  on  the  biplot  —  the 

dashed  line  on  Figure  4  --  and  found  to  be  roughly  horizontal. 

Also,  a  Spring-Summer  sum  (h^  +  h^  +  h^)  —  dashed  and 
dotted  line  on  Figure  4  --  slants  up  at  roughly  45°  whereas 
a  Spring-Summer  versus  Fall-Winter  difference  (h,  +  h„  +  hc)  - 

—  J  —4  —  J 

(h.  +  h_  +  to  —  dashed  and  double  dotted  line  on  Figure  4  -- 
is  pretty  close  to  vertical. 

The  importance  of  such  combinations  of  variables  is 
great.  For  example,  we  note  the  northern  hemisphere 
station  points  to  be  mostly  above  the  biplot  origin  and 
the  southern  hemisphere  station  points  to  be  below.  This 
vertical  difference  evidently  is  one  of  Spring-Summer 
versus  Fall-Winter  temperatures  —  the  very  well  known 
fact  that  maximum  temperatures  in  the  Northern ( Southern ) 
hemisphere  are  in  the  Spring-Summer (Fall-Winter ) .  Similarly, 
the  north  coastal  South  American  station  points  are  farthest 
to  the  right  of  the  biplot,  indicating  that  average 
temperatures  are  highest  in  that  region  --  again  a  well-known 
fact. 

These  features  of  the  biplot  are  of  considerable 
importance  for  data  analysis.  They  allow  one  to  go  beyond 


separate  descriptions  of  variables  and  of  units  and  actually 
account  for  units'  clusters  and  patterns  in  terms  of  the 
variables  that  determine  them. 


Finally,  it  will  be  noted  that  the  signed  length 
of  the  projection  of  any  unit's  vector  onto  any 

variable's  h_v  vector  direction  approximates  l//rT  times  the 
standardized  (mean  zero,  variance  one)  observation  on  that 
variable ,  i . e . , 


q!h  /I  [h  I  |im>i  (2. 
2i-v  1  1  —v  1  1  -  1  ,  v 


(v))//n  SV 


(2.15) 


— this  is  evident  from  approximations  (2.6)  and  (2.13)  of 
s  and  y.  ,  respectively.  Clearly,  the  same  holds  for 

V  1  /  V 

any  linear  combination  of  variables  when  projections  are  made  onto 
the  appropriate  vector  combination  of  h's. 

To  illustrate,  the  centroid  to  22q  vector  has  been 
projected  onto  h  ( 3  +  4+5_i -2-5)  F2l5ure  4.  The  length  of 

the  projection  is  .62  whereas  the  length  of  the  vector 
projected  upon  is  590.  Thus  the  biplot  approximation  of 
the  standardized  Spring-Summer  versus  Fall-Winter 
difference  for  station  20  is  /To  x  .62  =  2.77  —  this  is 
an  extreme  observation  as  is  evident  from  the  biplot  and 
the  actual  difference  463.8  (as  measured  from  the  centroid) 
is  approximated  on  the  biplot  by  .62  x  690  =  428. 
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2.7.  Joint  approximation  in  three  dimensions  -  the  bimodel 

The  biplot  displays  the  r_nk  2  least  squares  approxi¬ 
mation  of  Y  by  GH ' .  One  could  similarly  obtain  a  rank  3 
approximation  by  solving  (2.3)  and  (2.8)  also  for  a=3 
and  adding  a  further  column  to  H ,  to  F  and  to  G  —  the 
resulting  bimodel  could  be  constructed  in  three-space  since 
each  and  h^  now  has  three  coordinates.  Higher  dimensional 
approximations  can  also  be  calculated  by  solving  (2.3)  and 
(2.8)  for  further  ct's,  but  these  cannot  be  constructed 
physically . 

It  is,  however,  feasible  to  inspect  the  three  or 
higher  dimensional  approximations  by  displaying  various 
projections  on  a  CRT.  Facilities  exist  on  some  computer 
installations  that  allow  rotation  of  the  higher  dimensional 
approximation  so  one  gets  successive  two  dimensional  views 
from  different  angles.  This  may  be  quite  useful  in 
revealing  features  of  data  that  are  not  apparent  from 
the  original  planar  approximation . 

As  an  illustration,  consider  again  the  h  configuration 
in  Figure  2.  The  annual  cycle  is  represented  by  an  upward 
movement  from  h^  through  and  h^  to  ^4  and  then  a  pretty 
similar  downward  movement  from  h^  through  h^  and  h,  to 
h1 .  This  suggested  that  there  might  be  something  like 
an  elliptical  orbit  of  the  h' s  in  three-space.  Indeed,  if 
one  replots  the  h  arrows  along  the  second  principal 
axis  and  an  axis  at  45°  to  the  third  and  fourth  principal 
axes  (found  by  trial  and  error)  are  does  find  such  an  orbit  -- 


i 
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Figure  5.  True,  the  extra  axis  displayed  in  Figure  5  accounts 
for  very  little  of  the  data's  variability,  but  there  is 
something  satisfying  to  have  a  model  which  displays  an 
annual  cycle  rather  than  a  mere  two  season  clustering. 

Whether  or  not  such  a  model  is  appropriate  and  worthwhile 
for  the  data  used  here,  it  illustrates  the  possibilities 
of  using  the  higher  dimensional  bimodels  for  further  inspection 
of  data. 
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3.  DATA  ANALYSIS  OF  THE  VARIABLES'  CONFIGURATION 

3.1.  Purposes  of  data  analysis 

Data  analysis  aims  at  systematizing  and  summarizing 
data  by  noting  regularities,  tracing  patterns,  fitting 
models,  etc.  When  the  configuration  of  a  set  of  variables 
is  described  by  its  variance  matrix,  a  data  analysis  will 
attempt  to  elicit  the  salient  features  of  variability  and 
inter-correlation  of  the  variables.  Display  of  h-vectors 
in  a  biplot,  or  in  a  higher  dimensional  bimodel,  allows 
visual  inspection  of  the  configuration  and  may  suggest 
grouping  of  highly  correlated  subsets  of  variables  whose 
h-vectors  form  tight  sheaves.  It  may  also  indicate  regular 
patterns  such  as  the  elliptical  orbit  associated  with  the 
annual  cycle  of  temperatures  illustrated  above  (Section  ZD  . 
Such  indications  of  regularity,  whether  suggested  by  visual 
inspection  or  otherwise,  may  lead  to  formulation  of  a 
"model"  or  systematic  description  of  the  set  of  variables. 


t 


3.2.  Variables'  sheaves  and  clustering  algorithms 


The  most  common  concept  used  for  such  descriptions  is 
that  of  a  "typical  variable."  When  the  variables  group 
naturally  into  subsets  such  that  there  is  high  correlation 
between  variables  within  subsets  and  much  lower  correlation 
from  subset  to  subset,  then  one  naturally  thinks  of  a 
"typical"  variable  for  each  subset.  Geometrically,  when  the 
h-vectors  separate  into  several  tight  sheaves,  one  may  well 
describe  each  sheaf  by  a  typical,  or  average,  h-arrow  going 
through  the  center  of  the  sheaf.  Thus,  in  the  temperature 
example  in  Section  2.2  above,  the  months'  configuration 
seemed  to  cluster  into  a  Fall-Winter  sheaf  and  a  Spring- 
Summer  sheaf,  and  one  could  think  of  a  typical  variable  for 
each . 

Where  there  are  too  many  variables  for  easy  direct  or 
graphical  inspection,  one  may  try  to  check  for  sheaves  by 
some  method  of  cluster  analysis.  If  the  variables  do  form 
separate  tight  sheaves,  this  will  be  revealed  by  any  clustering 
algorithm.  However,  in  many  cases  there  are  no  tight  and 
well  separated  sheaves  and  application  of  clustering  tech¬ 
niques  does  not  yield  meaningful  results . 

Unfortunately,  clustering  algorithms  -  of  which  there 
are  many,  and  quite  a  few  are  available  within  standard 
statistical  computer  packages  -  always  produce  some  output. 

When  no  clear  sheaves  exist  in  the  configuration,  different 
algorithms  will  yield  different  clusterings,  none  of  which 


> 
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are  particularly  meaningful.  In  such  cases  one  would  do  better 
not  to  use  algorithms  to  force  "clustering"  —  they  should  be 
used  only  to  check  if  clustering  actually  exists  and  then 
reveal  the  existing  clusters.  A  good  practical  rule  may  be 
to  use  a  number  of  alternative  "clustering  algorithms":  any 
"clusters"  that  are  not  revealed  by  most  of  the  algorithms  must 
be  considered  suspect  -  they  are  likely  to  be  artifacts. 
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3.3.  Principal  components 

When  variables  do  not  readily  group  into  distinct 

sheaves  one  may  define  "typical"  variables,  or  LCV's,  in 

a  different  sense,  one  more  akin  to  averaging.  Thus,  the 

"most  typical"  LCV  is  often  taken  to  be  that  which  has  the 

highest  average  correlations,  or  squared  correlations, 

with  the  observed  variables.  This  is  obviously  an  attractive 

descriptive  property  -  such  a  "most  typical"  LCV  is  by 

* 

definition  highly  correlated  with  the  variables  it  typifies. 

If  one  wishes  to  describe  the  variables'  configuration 
by  more  than  one  "typical"  LCV  one  may  consider  the  residuals 
from  regressing  each  variable  on  the  first  "most  typical" 

LCV.  Again,  one  may  seek  the  LCV  most  highly  correlated 
with  the  residual  parts  of  the  variables  -  this  will  be  the 
"second  most  typical"  LCV  and  will  be  found  to  be  uncorrelated 
with  the  first. 

One  may  continue  in  this  way,  again  taking  residuals 
and  obtaining  a  "third  most  typical  LCV,"  etc. 

The  logic  of  looking  at  these  successive  residuals  is 
not  straightforward.  Only  the  first  of  the  "typical"  LCV's 
is  directly  related  to  the  original  variables.  For  all  the 
others  it  is  not  at  all  obvious  if  they  can  be  considered 
"typical"  of  the  original  variables. 

When  the  criterion  of  "most  typical"  is  that  of  maximum 
average  squared  covariance  for  any  normalized  LCV  (i.e.,  an 
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2 

LCV  Z(C)  =  Evcv£(v)  normali2ed  so  that  Zvcv  =  1  }  ^ 

resulting  "typical  variables"  are  referred  to  as  principal 

components.  Thus,  the  first  principal  component  (PC1  for 

short)  is  the  vector 

W(v)  =  (3-1} 

which  satisfies 

T  {zv(^(v)  Y£)2  ••  £'£  =  U 

=  max{||Y'Y£  |  |:  c'c  =  1}.  (3.2) 

c 

But  this  is  satisfied  by  solution  c  =  of  eguations  (2.3). 
Hence  PC^  is  given  as 

y2l  =  •  (3.3) 

which  is. the  first  column  of  G  (2.8), 

The  next  solution  of  (2.3),  i.e.,  >  similarly 

yields  pc2  as 

Y£2  =  x2^2  '  (3.4) 

the  second  column  of  G. 

These  two  PC’s  are  uncorrelated  for 

xlEiE2x2  =  SLiY'Y32  =  3i9.2  •  (3.5) 

But  these  eigenvectors  are  known  to  be  orthogonal  unless 
X1  *  x2’ 
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In  general,  PCa  is  the  LCV  with  observations  vector 
obtained  by  the  solution  of  (2.3)  with  the  a-th  largest 
root  X2a. 

Another  property  of  the  PC's  is  that  PC^  has  the  largest 
variance  of  all  normalized  LCV's,  i.e., 

!lY2il!2/n  =  X1£^£1A1/n  =  \2/n  .  (3.6) 

Similarly,  amongst  all  LCV's  uncorrelated  with  PC^ »  it  is 
PC2  which  has  the  largest  variance 

2  2 

I |y22I I  /n  =  X2P^P2X2/n  =  \2/n  f  (3.7) 

and  so  on  for  other  PC's.  Another  property  of  PC's  is  that  the 
first  two  provide  the  principal  axes  of  the  biplot  -  P^ 
is  along  the  horizontal  axis,  PC2  along  the  vertical  - 
and  the  remaining  PC's  are  along  the  other  principal  axes 
of  the  bimodel  and  higher  order  approximations.  This  last 
property  is  due  to  the  fact  that  PC's  have  simple  least 
squares  properties  (which  were  discovered  by  Householder  and 
Young  in  1938) .  In  particular,  the  plane  that  best  approxi¬ 
mates  the  configuration  is  that  going  through  the  first  two 
principal  axes.  In  other  words,  the  best  fitting  two  dimen¬ 
sional  approximation  of  Y  is 


Y  £2 ]  AlEl2i  +  x2^2^2  ' 


(3.8) 
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which  is  a  function  of  PC  and  PC_  only.  It  is  because  of 

1  *• 

this  least  squares  property  that  these  two  principal  axes 
were  chosen  to  serve  as  horizontal  and  vertical  axes  of 
the  biplot. 

The  generalization  of  these  remarks  to  a  3-D  or  higher 
order  approximations  is  obvious. 

The  relation  of  the  PC's  to  coordinates  of  the  biplot 
is  simply  that  the  i-th  unit's  observation  on  PCq  is 


y'  .q 
i— a 


=  X  g . 

a  l ,  a 


(3.9) 


Far  the  first  two  PC's  this  follows  from  (2.9)  and  (3.3),  (3.4). 
Table  11  gives  the  six  principal  coordinates  for  the  twenty 
stations  of  the  temperature  data  of  Table  1.  It  is  not  obvious 
what  interpretation  one  might  want  to  put  on  these  coordinates 
as  such,  though  plotting  the  first  three  —  scaled  by  x^, 
and  X^ ,  respectively,  was  found  useful  in  inspection  of  the 
bimodel. 
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Table  11 


None  of  these  mathematical  properties  make  it  clear 
why  the  PCs  should  be  particularly  interesting  for  an  under¬ 
standing  of  the  variables'  configuration.  Clearly,  PC^ 
makes  intuitive  sense  as  an  "average"  or  typical  variable, 
or  as  the  LCV  with  maximal  variability.  But  what  of  PC2, 

PC^ ,  etc.?  They  do  not  seem  to  have  clear  intuitive  de¬ 
scriptive  appeal.  Their  least  squares  property  makes  them 
useful  for  building  approximations  in  the  plane  -  the  biplot  -, 
in  3-D  -  the  bimodel  -,  etc.,  but  that  does  not  make  them 
interesting  individually.  In  the  temperature  example,  the 
interesting  "typical"  LCVs  seemed  to  be  going  at  45°  and 
-45°  rather  than  at  0°  and  90°  to  the  horizontal.  The 
fact  that  the  PC's  and  principal  axes  are  useful  for  plot¬ 
ting  does  not  necessarily  make  them  useful  for  interpreta¬ 
tion.  One  usually  does  better  by  relating  the  g-points 
to  the  h's  of  the  original  variables  rather  than  to  the 
axes  for  the  PC's. 

A  great  many  applications  of  PC  analysis  have  been 
made  through  the  years.  As  a  method  of  approximation  in 
lower  dimensional  space,  this  is  fine,  but  as  an  inter¬ 
pretative  device  its  popularity  is  surprising.  What  the 
method  does  is  to  express  the  original  variables  in  terms 
of  PCs  in  the  form 

+  \ 2P2S2  fV 


iL(v)  =  X!Ei2i 


v 


(3.10) 


-  as  follows  by  taking  a  column  of  equation  (3.8).  In 
view  of  (3.3),  (3.4),  it  is  clear  that  the  method  also  allows 
the  PC's  to  be  expressed  in  terms  of  the  variables  as 


\  p  =  Z  y ,  ,  q 
a— a  v^(v)^a,v 


(3.11) 


(Note  that  the  weights  in  both  linear  combinations  are  the 
q's  obtained  by  solving  (2.3):  they  are  therefore  referred 
to  as  loadings. )  But  does  this  provide  any  insight?  Does 
(3.10)  "explain"  the  variables  by  the  PC's  or  does  (3.11) 
"explain"  the  PC's  by  the  variables?  Or  do  we  expect,  by 
circular  reasoning,  to  have  both  "explanations"? 

At  best,  consideration  of  the  loadings  q  gives  some 
insight  into  which  variables  are  correlated  with  what  others 
(similar  loadings  on  the  first  few  PCs) .  What  is 
puzzling  are  the  attempts  of  many  users  of  PC's  to  "reify" 
these  mathematical  constructs  and  ascribe  "inherent," 
"underlying"  or  "explanatory"  content  to  them.  It  is  not 
evident  how  any  such  content  follows  from  the  mathematical 
definition  of  PC's  and  one  may  suspect  that  much  of 
what  has  been  published  as  PC  analyses  may  have  obscured 
rather  than  illuminated  the  configuration  of  the  original 
variables  which  should  have  been  studied. 

The  loadings  of  the  monthly  temperatures  in  the  six 
PC's  are  shown  in  Table  12.  The  uniformly  high  PC^ 
loadings  for  all  months,  show  PC^  to  be  some  average  annual 
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TABLE  12 


Temperature  Data  -  PC  Loadings  a 


Variable 

PC 

i 

a 

1 

2 

3 

4 

5 

6 

JAN 

1 

.452 

-.540 

-.405 

-.561 

-.158 

-.005 

MAR 

2. 

.436 

-.199 

-.224 

+  .700 

.028 

-.479 

MAY 

3 

.370 

.272 

-.350 

+  .258 

.088 

.770 

JUL 

4 

.293 

.601 

-.182 

-.349 

.505 

-.377 

SEP 

5 

.374 

.432 

.247 

-.073 

-.772 

-.107 

NOV 

6 

|  .492 

-.218 

.754 

-.034 

.339 

.156 

temperature  factor  which  puts  more  emphasis  on  Fall-Winter  — 

as  is  evident  from  the  biplot  in  which  all  months '  h-arrows 

point  left,  partly  above  and  partly  below  the  horizontal 

direction,  but-  the  Fall-Winter  h's  are  closer  to  horizontal 

than  the  Spring-Summer  h's.  The  PC2  loadings  are  positive 

for  Spring-Summer,  negative  for  Fall-Winter,  and  thus  indicate 

a  seasonal  component  —  again,  this  was  evident  from  the 

biplot  configuration.  Loadings  on  PC^,  PC^ ,  PC5  and  PCg 

are  not  so  easily  interpretable  —  though  the  joint  consid- 

* 

eration  of  PC2,  PC^  and  PC^  in  a  bistructure  could  be  inter¬ 
preted  and  modelled  in  terms  of  an  annual  cycle.  Nothing 
is  revealed  by  consideration  of  these  axes  that  was  not 
seen  by  inspection  of  the  h's  themselves.  The  h-configura- 
tion  is  much  more  simply  described  by  two  sheaves,  one  for 
each  half-year,  than  by  two  axes,  one  a  weighted  annual 
average,  the  other  a  weighted  contrast. 

This  example  illustrates  the  shortcomings  of  PC 
analysis  in  considering  each  principal  axis  separately 
and  the  advantage  of  seeing  the  overall  picture  in  a  biplot 
or  bimodel.  It  also  illustrates  the  limitations  of  using 
orthogonal  axes  fitted  by  least  squares  —  these  do  not 
necessarily  provide  the  most  readily  interpretable  references. 


PC's  depend  on  the  scale  of  measurement  of  the  original 
variables.  This  dependence  is  obvious  from  the  definitions 
which  depend  on  covariances,  variances  and  least  squares 
fits.  Much  has  been  written  on  this  dependence  and  how 
it  limits  the  usefulness  of  PC  analysis.  All  this  seems 
beside  the  point:  There  is  no  real  reason  to  consider  PC 
analysis  as  a  method  for  revealing  the  "underlying"  structure 
or  to  regard  PC's  as  "intrinsic"  variables  and  hence  there 
it  also  does  not  matter  that  these  "structures"  and  "intrinsic 
variables"  are  not  scale  independent. 

Finally,  PC's  are  often  reified  by  reference  to  other 
variables  extraneous  to  the  original  set.  In  the  temperature 
illustration,  it  was  not  difficult  to  label  P^  and  PC2, 
though  no  simple  interpretation  was  evident  for  other  PC's. 
Another  illustration  is  a  recent  study  of  rainfall  where 
PC^  was  noted  to  have  a  clear  time  trend  associated  with 
the  spread  of  irrigation.  It  was  therefore  suggested  that 
PC4  was  an  "irrigation  factor"  and  the  possibility  of 
using  it  as  such  a  variable  was  considered.  Direct  use 
of  the  irrigation  data  —  with  which  PC^  had  been  found  to 
be  correlated  —  would  have  been  simpler  and  more  straight¬ 
forward  and  would  not  have  required  PC  analysis  at  all. 

This  was  a  pretty  typical  example  of  the  "use"  of  PC's 
and  the  rationale  of  many  other  uses  of  PC  analysis  is 
equally  puzzling. 
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To  summarize,  this  author's  view  is  that  PCs  are 
unlikely  to  have  explanatory  value  in  themselves.  The 
most  physically  meaningful  LCVs  will  not  usually  happen  lo 
lie  along  the  principal  axes  of  the  configuration.  This 
author  sees  the  main  usefulness  of  PCs  as  a  tool  to  provide 
least  squares  approximations  to  data  matrices  and  variables ' 
configurations  and  he  would  direct  the  scientific  attention 
of  investigators  to  what  the  approximation  tells  them  about 
the  original  variables ,  and  not  to  what  it  shows  about  the 
PCs.  The  investigator  must  have  included  his  variables 
because  he  wants  to  know  something  about  them,  so  let  him 
discuss  them  instead  of  substituting  mathematical  artifacts. 
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3.6.  The  rank  or  dimension  of  a  configuration 

When  a  number  m  of  variables  are  observed  on  more  than  m 
units,  the  configuration  may  be  completely  in  a  sub-space  of 
m-1,  m-2  or  fewer  dimensions,  but  this  is  very  unusual. 

It  indicates  exact  linear  relations  between  the  variables, 
even  though  these  variables  must  be  affected  by  random 
variability  and  measurement  error.  When  such  things  are 
actually  observed,  one  usually  finds  that  the  original  set 

of  variables  includes  some  repetitions  of  observations  or 
sums,  or  averages,  of  other  variables  which  are  also  included. 

It  is  rare  and  surprising  to  find  such  exact  dependence 
otherwise.  A  set  of  m  variables  observed  on  n  (>m)  units 
almost  invariably  generates  a  configuration  in  m-space  that 
cannot  fit  exactly  into  any  lower  dimensional  subspace.  It 
may  well  be  approximated  in  a  lower  space,  and  perhaps  even 
very  closely  approximated,  but  it  is  very  unlikely  to  fit 
exactly. 

This  suggests  that  the  question  of  dimensionality 
rarely  relates  to  the  true  configuration  of  variables  but 
usually  makes  sense  only  in  the  context  of  approximation. 
"Hypotheses"  of  reduced  dimensionality  are,  in  this  author's 
opinion,  rank  nonsense,  and  the  techniques  of  statistical 
significance  are  only  rarely  relevant  to  problems  of 
dimensionality.  "Tests  of  significance"  of  PC's  will  therefore 

not  be  discussed  here.  If  the  hypothesis  that  a  6-variate 
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configuration  is  in  a  plane  is  physically  incredible,  it 
makes  no  sense  to  test  it  for  significance,  i.e.,  to  test 
the  nullity  of  PC^,  PC^ ,  etc.  Hypotheses  testing  makes 
sense  only  if  the  hypotheses  can  be  given  credence. 

This  issue  of  dimensionality  should  correctly  be 
addressed  as  one  of  approximation  and  not  of  hypothesis 
testing.  The  biplot  plane  fitted  the  6-variable  temperature 
data  to  a  goodness  of  fit  of  0.967  and  the  3-D  bimodel  had 
a  0.985  fit.  That  may  well  justify  ignoring  the  remaining 
dimensions  to  all  intents  and  purposes  even  if  one  is 
certain  that  there  i£  some  variability  along  those  axes. 
There  is  no  question  of  "testing"  whether  the  data  are 
in  a  plane  or  in  3-D;  the  practical  issue  is  simply  that 
the  fraction  of  real  variation  that  lies  outside  the  plane 
or  3-D  is  negligible  and  need  not  be  considered  in  inter¬ 
preting  the  data  -  even  though  it  is  not  assumed  to  be 
strictly  null. 
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3. 7.  The  factor  analytic  model 

Factor  analysts  postulate  a  model  in  which  each  variable 
can  be  written  in  the  form 

(3.12) 


*(v) 


=  rr  £ ,  .1  +e ,  . 
cf=l-(a)  a»v  -(v)  / 


as  the  sum  of  a  linear  combination  of  a  few,  say  r,  "factor" 
variables  f ,f .  with  "loadings"  1 a ^  and  errors  e(v) 
specific  to  each  variable  and  uncorrelated  either  with  the 
factors  or  with  one  another. 

When  the  rank  r  is  sufficiently  large  compared  to  m, 
this  model  has  an  exact  solution.  In  fact  it  then  usually 
has  infinitely  many  solutions.  However,  factor  analysts 
usually  postulate  r  to  be  quite  small  relative  to  m  (so  as 
to  obtain  parsimony  in  description) ,  and  then  the  model 
is  most  unlikely  to  fit  exactly.  A  satisfactory  fit  may  at 
times  be  obtained  if  the  data  are  considered  as  a  random 

sample  from  a  population  in  which  the  e/s  are  uncorrelated 
between  themselves  and  with  the  f's. 

What  factor  analysts  do  in  practice  is  to  approximate 
the  data  by  a  model  of  type  (3.12)  with  rank  as  low  as 
will  allow  a  reasonable  fit.  When  the  purpose  of  a  factor 
analysis  is  avowedly  approximative  the  criteria  for  the 
method  would  be  goodness  of  fit  and  parsimony.  If  we 
compare  an  approximation  by  model  (3.12)  with  a  PC  approx¬ 
imation  of  the  same  rank,  it  is  clear  that  the  factor  analytic 
model  is  very  much  less  parsimonious  in  that  it  requires  the 


e  terms  to  be  uncorrelated.  The  fit  of  its  a  part  is 

necessarily  worse  than  that  of  the  first  r  PCs  because  the 
latter  are  required  only  to  give  the  least  squares  fit. 

However,  one  variant  of  Factor  Analysis  sets  out 
directly  to  approximate  the  correlations.  Unlike  PC  analysis 
it  does  not  approximate  the  data  matrix  Y,  nor  does  it 
approximate  the  diagonal  elements  of  R  as  these  are  known  to 
equal  unity.  This  particular  variant  of  factor  analysis  — 
called  MINRES  —  is  justified  directly  in  terms  of  optimal 
approximation  of  the  correlations  -  the  off-diagonal  elements 
of  R. 

As  in  the  case  of  principal  components,  one  has  to  ask 
whether  the  model  as  such  makes  physical  sense  so  that  the 
factors  are  "intrinsic"  variables,  or  whether  it  serves  as  a 
mere  vehicle  of  parsimonious  approximation.  Our  answer  to  the 
first  question  should  be  similar  to  the  one  we  have  given  for 
PC  analysis:  We  see  no  a  priori  reason  to  think  the  "factors" 
fitted  in  model  (3.12)  are  any  more  "real”  than  the  PCs. 

The  factor  analytic  model  is  no  more  plausible  than  the 
hypothesis  of  lower  dimensionality  which  we  discussed  in 
connection  with  PC  analysis.  However,  its  saving  grace  is 
that  it  is  so  flexibly  defined  as  to  allow  considerable 
manipulation  which  can  on  occasion  be  used  to  advantage. 
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Thus,  for  given  rank  r,  model  (3.12)  becomes 

Y  -  E  =  F  L,  (3.13) 


with  obvious  definitions  for  matrices  E,  F,  and  L,  and  the 
flexibility  of  this  representation  is  that  not  only  E  is 
not  uniquely  determined,  but  F  and  L  can  be  changed  into 


F*  =  F  Q  (3.14) 

and 

L*  =  Q-1L  (3.15) 


by  any  non-singular  rxr  matrix  Q.  Factor  analysts  spend 
much  ingenuity  in  rotating  their  original  F,  L  solution 
into  a  solution  F*,  L*  that  "makes  sense"  so  that  the 
resulting  f£  "factors"  have  some  reality  and  are  useful 
in  interpreting  data. 

In  some  cases  these  rotations  are  chosen  so  as  to 
yield  factors  correlated  with  extraneous  variables  or  other 
information  available  to  the  investigator.  It  is  difficult 
to  see  what  "explanatory"  runction  such  a  procedure  has. 

The  investigator  had  the  "explanation"  or  extraneous  variable 
anyway,  and  he  could  have  correlated  the  original  variables 
with  it.  Why  bother  to  use  factor  analysis?  Why  not  just 
take  the  multiple  regression  of  the  extraneous  variable  on 
the  as  the  "factor"? 

Some  methods  of  rotation  such  as  varimax  and  other 
computerized  techniques  are  built  so  as  to  make  individual 
f*'s  as  closely  representative  of  sheaves  of  variables  as 
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possible.  That  brings  us  back  to  the  subject  of  applying 
clustering  techniques  to  variables'  configurations,  an 
approach  whose  careful  use  may  well  yield  important  data 
analytic  insight. 

Again,  it  is  difficult  to  see  what  the  role  of  the  factor 
analytic  model  is  in  all  this.  If  one  seeks  correlation 
with  extraneous  information  one  can  best  do  it  directly, 
on  the  data  rather  than  on  the  "factor  solution."  If  one 
wants  to  organize  the  variables  into  sheaves,  it  is  not 
obvious  that  one  had  best  start  from  a  set  of  fitted  loadings  - 
but  it  may  be  legitimate  to  do  so.  It  is  essential  to 
understand  that  in  all  these  applications  there  seems  no 
essential  role  to  the  "factor  analytic  model."  This  model 
has  neither  reality  nor  much  usefulness,  except  under  certain 
circumstances,  as  an  approximating  device. 

Our  view  of  factor  analysis  differs  sharply  from  that 
of  most  practitioners  of  these  techniques  who  talk  about  their 
model  as  though  it  had  inherent  reality.  Even  when  they  use 
a  clearly  approximative  technique  such  as  MINRES  they  try 
to  reify  the  resulting  factors.  Indeed,  the  MINRES  solution 
would  sometimes  involve  imaginary  numbers  (Gabriel,  1978) 
but  factor  analysts  shy  away  from  such  an  optimal  approximation 
because  they  believe  in  the  "reality"  of  their  factors. 
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4.  ANALYZING  THE  SCATTER  OF  THE  UNITS 

4.1.  A  batch  of  units  and  its  scatter 

The  units  whose  observations  make  up  the  rows  of  data 
matrix  Z  usually  have  identities  of  their  own  --  "labels" 
as  the  sampling  theorists  call  them  —  and  these  identities 
may  be  relevant  to  the  analysis  of  the  data.  Some  relations 
between  units  may  be  given  a  priori  and  it  may  be  of  interest 
to  study  if  and  how  they  are  associated  with  statistical 
similarity  of  the  corresponding  rows  of  Z.  A  priori  group¬ 
ings  of  the  units  in  terms  of  information,  extraneous  to 
data  matrix  Z,  could  be  related  to  the  statistical  scatter 
and  to  similarities  of  the  corresponding  z^s.  Data  analysis 
is  often  concerned  as  much  with  the  units  as  with  the  vari¬ 
ables.  In  our  example  it  is  certainly  as  legitimate,  and 
interesting,  to  study  the  scatter  of  stations  as  it  is  to 
study  the  variance  configuration  of  months. 

In  modern  statistics  books  this  subject  is  hardly 
dealt  with  at  all,  and  the  idea  of  between  units  distance 
barely  receives  mention.  This  is  because  the  fashion  has 
been  to  deal  exclusively  with  inference  based  on  random 
samples  from  a  population  or  distribution.  And  in  that 
context  the  units  of  observation  lose  their  individuality 
and  become  mere  replications  in  a  sampling  process. 
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This  sampling  approach  is  undoubtedly  appropriate 

in  many  experimental  situations  and  in  situations  like 

industrial  quality  control  where  repeat  observations  are 

carried  out  regularly.  But  it  is  not  appropriate  to  the 

study  of  batches  of  units  with  well  defined  identities 

and  labels.  Ignoring  the  information  associated  with 

these  identities  may  stultify  the  analysis  of  such  data. 

In  this  section  we  consider,  therefore,  methods  of 

analyzing  the  units'  scatter  and  we  choose  to  do  so  in  terms 

of  standardized  distances  /d.  . 1  between  pairs  of  units 

i  / 1 

(1.13)  and  /u.  .  between  units  and  the  centroid  (1.12). 
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We  will  find  it  convenient  to  consider  the  scatter  also 
in  terms  of  the  biplot  approximations  |  1 3.-^ '  I  I  and  IIS-M  - 
(2.10)  and  (2.11)  -  of  the  above  distances. 


4.2.  Use  of  extraneous  information  on  the  units 

When  extraneous  information  is  available,  i.e.,  in¬ 
formation  on  unit  i  other  than  the  observation  z.',  this 

—l 

information  may  be  correlated  with  the  observations.  When 
the  units  fall  into  a  number  of  categories,  one  may  check 
whether  these  categories  are  associated  with  the  statistical 
scatter  of  points.  Do  the  categories  form  distinct  group¬ 
ings  in  m-space  and/or  on  the  biplot?  Is  there  much  or 
little  overlap  between  categories? 

A  simple  device  is  to  mark  the  units  of  each  category 
by  a  different  mark,  or  color,  on  the  biplot  and  see  if  the 
categories  do  separate.  Figure  6  shows  the  g-points  of  the 
temperatures  biplot  (Figure  2)  classified  according  to 
whether  they  are  North  or  South  of  the  equator.  A  clear 
separation  is  evident,  showing  that  the  temperature  profiles 
of  Northern  hemisphere  stations  differ  from  those  in  the 
Southern  hemisphere.  The  former  are  at  the  top  of  the 
biplot,  the  latter  at  the  bottom.  Recalling  that  the 
vertical  direction  on  the  biplot  was  a  contrast  between 
Spring-Summer  and  Fall-Winter  (Section  2.5,  above)  one  sees 
that  the  Northern  versus  Southern  hemisphere  groupings 
reflect  the  difference  in  the  season  in  which  their  maximum 
temperatures  occur. 

When  the  extraneous  information  is  not  categorical  but 
rather  of  a  continuously  variable  character,  the  methods  of 
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Figure  6:  Biplot  of  Temperature  Data  With  Indication  of 
Hemisphere  of  Station  (N  »  Northern;  0=  Within 
2  Degrees  of  Equator;  C  =  Southern) 


analysis  are  less  obvious.  A  good  idea  is  to  record  the 
extraneous  measurements  on  the  g  points  of  the  biplot  and 
see  if  they  show  some  regularity  in  the  plane.  Thus,  in 
the  temperature  example  we  have  marked  the  altitude  on  each 
g-point  in  Figure  7  and  we  see  at  once  that  the  distribution 
on  the  biplot  shows  much  regularity. 

The  leftmost  g-points  are  those  of  stations  at  high  elevations  - 
evidently  there  is  some  right  to  left  trend  in  altitude.  As 
this  trend  is  in  direction  opposite  to  the  general  direction 
of  the  h-arrows  for  months'  temperatures,  one  would  conclude 
(again  unsurprisingly)  that  temperatures  are  rather  lower 
at  higher  elevations. 

Perhaps  some  additional  comments  on  this  example 
would  further  illustrate  uses  of  the  biplot.  The  North-South 
differences  of  Figure  6  and  the  altitude  differences  of 
Figure  7  account  for  a  great  deal  of  the  variability  of  the 
stations.  However,  a  number  of  stations  do  not  quite  fit  the 
pattern,  especially  stations  2  and  10  which  are  much  farther  to 
the  right  of  the  biplot  than  one  would  expect  from  their 
altitudes.  Checking  their  locations  on  Figure  2  one  sees  these 
stations  to  be  far  inland  on  the  South  American  continent. 
Evidently,  in  addition  to  altitude  and  to  Northern  versus 
Southern  latitude,  distance  inland  also  plays  a  role  in 
determining  temperatures. 

This  illustration  shows  how  the  biplot  can  be  used  to  check 
hunches  about  relationships  to  extraneous  variables  and  how 
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inspection  of  the  biplot  may  suggest  new  things  to  look  for. 
course,  these  are  subjective  impressions  and  their  effective 
depends  very  much  on  the  ideas  the  investigator  may  be  able 
to  generate.  The  biplot  will  help  him;  it  will  not  provide 
objective  substitute  for  his  intuition. 
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4.3.  Clustering  of  units 

Some  groupings  of  units  may  be  evident  from  the  inter¬ 
unit  distances  themselves,  rather  than  from  extraneous 
information.  Such  data-dependent  groups  will  be  referred 
to  as  clusters,  and  methods  of  defining  such  grouoinas  will 
be  referred  to  as  clustering  algorithms:  They  differ 
from  those  used  for  locating  sheaves  of  variables  in  that 
they  relate  to  units  rather  than  to  variables  and  that  the 
criterion  for  clustering  is  small  inter-unit  distances, 
whereas  the  criterion  for  forming  sheaves  was  high  inter¬ 
correlation  of  variables. 

Many  methods  of  clustering  are  available.  A  very 
simple  one  uses  single  linkage.  To  begin  with,  one  clusters 
the  nearest  two  points  together.  At  the  second  stage  one 
considers  the  next  smallest  distance:  If  it  is  between  one 
of  the  first  two  points  and  a  third  point  one  clusters  all 
three  points  together;  if  it  is  between  two  other  points, 
one  forms  a  second  cluster  of  those  two  points.  At  each 
successive  stage  one  considers  the  smallest  of  the  distances 
betweeen  points  which  are  not  already  in  the  same 
cluster.  The  points  separated  by  this  least  distance  are 
then  linked  together,  and  with  them  any  other  points  clustered 
previously  to  either  of  them.  Thus,  at  a  particular  stage 
units  8  and  15  have  least  distance  0.28,  and  in  previous 
stages  unit  8  had  been  clustered  with  units  2,  6,  9,  10  and 
13  whereas  unit  15  had  been  clustered  with  units  7  and  14, 
then  the  new  cluster  consists  of  units  2,  6,  7,  8,  9,  10, 


13,  14  and  15. 
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One  may  thus  proceed  step  by  step  up  to  the  largest 
distance  between  points,  at  which  stage  all  units  become 
a  single  cluster.  In  practice  one  will  presumably  want  to 
stop  the  clustering  process  before  that,  either  when  the 
number  of  clusters  is  small  enough  or  when  the  remaining 
distances  are  too  large. 

The  entire  clustering  process  can  be  displayed  by  a 
dendogram  which  is  an  inverted  tree-like  structure  with  a 
vertical  scale  corresponding  to  distance.  This  dendogram 
has  a  single  stem  on  top  at  the  height  of  the  largest 
distance,  when  all  units  are  clustered  together.  At  the 
bottom  of  the  dendogram,  below  the  height  of  the  least 
distance,  it  has  n  separate  branches,  one  for  each  unit. 

In  between,  at  the  height  of  each  distance,  it  has  as 
many  branches  as  there  are  clusters  at  that  distance. 

Below  that  height  the  branch  further  branches  and  sub¬ 
branches  until  the  individual  units’  branches  are  reached. 

The  nearest  neighbor  dendogram  for  the  2Q  stations 
of  the  temperatures  example  —  corresponding  to  the 
standardized  biplot  distances  in  Table  9  —  is  given  in 
Figure  8 . 
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Figure  8:  Temperature  data  -  single  linkage  dendogram 
of  biplot  distances 
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(For  convenience,  the  order  of  the  points  has  been  re¬ 
arranged  to  correspond  as  closely  as  possible  to  that  of 
the  biplot  —  in  practical  application  this  is  of  course 
not  possible  since  the  "true"  order  is  not  known.) 

It  will  be  seen  that  the  dendogram  of  Figure  8  repro¬ 
duces  only  some  of  the  clusters  evident  from  the  biplot 
g-point  scatter  of  Figure  2.  Thus  if  we  divide  the  points 
into  four  clusters  by  lopping  off  the  branches  between 
heights  1.0  and  1.1,  one  cluster  >is  of  stations  4,  5  and 
11,  two  are  of  the  single,  stations  19  and  20,  and  one  of  the 
remaining  fifteen  stations.  This  is  not  very  satisfactory 
because  the  last  cluster  is  too  large  and  spread  out:  The 
largest  intra-cluster  distance  is  2.9  between  stations  8  and 
17.  Such  an  elongated  "cluster"  is  obtained  because  there  is  a 
"chain"  of  points  at  relatively  small  (below  1.0)  distances 
from  point  3  to  point  17,  i.e.,  3  to  1  (.99),  1  to  6  (.74), 

6  to  15  (.62),  15  to  16  (.41)  and  16  to  17  (.89). 


An  alternative  clustering  criterion  which  would  avoid 
such  "elongated"  clusters  uses  the  complete  linkage  and 
clusters  a  set  of  units  together,  at  distance  dg  and  above, 
only  if  all  units  within  the  set  are  within  dg  of  one 
another.  The  corresponding  dendogram  for  the  temperature 
data  is  shown  in  Figure  9 .  It  is  seen  to  differ  from  Figure 
8  not  only  in  that  it  shows  less  clustering  for  each  given 
distance,  but  also  in  that  it  results  in  somewhat  different 
clusters . 

Thus,  to  obtain  four  distinct  clusters,  one  would  lop 
off  the  branches  at  dg  =  2  and  the  resulting  clusters  would 
be  5,  4,  11  (as  before);  19,  20  (which  had  been  separated 
before);  1,  3,  12  and  the  remaining  twelve  stations  (these 
last  two  clusters  formed  a  single  cluster  by  the  previous 
method)  .  The  separation  of  that  elongated  cluster  ir.to 
two  tighter  clusters  seems  more  satisfactory. 
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temperature  data  -  complete  linkage  dendogram 
>f  biplot  distances 
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In  addition  to  these  two  clustering  criteria,  there  are 
many  others  in  the  literature  with  algorithms  and  computer 
programs  to  carry  them  out.  Each  of  the  methods  is  supposedly 
"objective,"  but  the  choice  of  a  method  is  a  subjective 
matter,  and  each  investigator  must  make  sure  he  is  using  a 
method  whose  criterion  is  meaningful  to  him  and  appropriate 
to  his  purposes. 

When  the  units  cluster  "naturally"  into  distinct  tight 
groups,  pretty  much  all  clustering  algorithms  will  reproduce 
that  pattern.  Often,  however,  the  scatter  does  not  reveal 
such  obviously  distinct  clusters  and  their  different 
algorithms  will  output  different  "clusters."  In  such 
cases  one  would  be  justified  in  using  some  "objective" 
method  only  if  one  were  really  satisfied  with  the  relevance 
of  its  criterion.  Otherwise,  analysis  into  "clusters"  becomes 
a  game,  especially  if  one  tried  out  a  variety  of  algorithms 
and  then  picked  out  one  of  them.  Indeed,  the  multiplicity 
of  available  algorithms  pretty  much  guarantees  that  any 
random  scatter  shall  "cluster"  nicely  by  some  one  of  the 
many  criteria.  The  investigator  should  be  cautioned  to 
inspect  the  clustering  criterion  carefully  before  he  commits 
his  data  to  an  "objective"  analysis  into  clusters. 

The  virtue  of  objectivity  in  data  analysis  is  not 
obvious.  A  subjective  approach  which  allows  some  capable 
researchers  to  obtain  insights  is  certainly  preferable  to  an 
objective  method  which  usually  fails  to  reveal  anything  worth¬ 
while  to  any  investigator.  One  should  not  carry  democracy  too 
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far.  In  the  analysis  of  scatters  of  units  (as  well  as  that 
of  configurations  of  variables) ,  the  capable  investigator  will 
usually  approach  his  data  with  a  great  deal  of  prior  knowledge, 
hunches  and  hypotheses  about  patterns  and  relationships.  He 
will  do  well  to  be  guided  by  them  and  direct  his  analysis 
accordingly.  If  he  wishes  to  cluster  his  data,  he  should  not 
do  so  "objectively,"  merely  on  the  basis  of  distance  (or  correlations) 
but  should  allow  the  interplay  of  observation  with  prior  hypothesis. 
Specifically,  if  unit  is  about  as  distant  from  unit  U2  as  from 
unit  ,  the  investigator  would  do  well  to  group  it  with  the 
unit  with  whom  he  has  a  priori  reason  to  expect  it  to  be  more 
closely  related. 

Clustering  algorithms  are  popular  not  only  because  they 
are  "objective"  (after  subjective  choice  of  the  algorithm) 
but  because  they  can  deal  with  large  scatters  (or 
configurations)  and  have  been  programmed.  It  is  very  difficult 
to  inspect  large  data  matrices  by  eye,  though  use  of  prior 
ideas  about  possible  patterns  may  be  of  great  help.  (See 
for  example  Guttman's  use  of  linear  and  circular  dependence 
patterns  —  the  simplex  and  radex  (Guttman,  1954)  — 
for  meaningful  inspection  of  correlation  matrices.) 
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4.4.  Outliers 

Distances  ^d.  .  ,  are  standardized  by  definition, 

1  I  1 

(Section  1.3,  above).  As  a  result,  the  scatter  of  points 
in  m-space  is  spherically  symmetric,  and  its  approximation 
on  the  biplot  is  essentially  circular.  Unlike  the  well- 
known  elliptic  forms  of  variability  of  row  variables,  their 
standardized  representation  is  circular. 

In  studying  the  form  of  the  distribution,  therefore,  we 
should  not  look  for  asymmetry  —  which  has  been  eliminated 
by  standardization  —  but  rather  for  other  features  such  as 
clumping  or  clustering  of  points,  special  patterns,  as¬ 
sociations  with  external  variables,  outliers,  etc. 

To  begin  with,  note  that  the  sum  of  squares  of  standardized 
distances  is  fixed  by  standardization.  Thus, 

n 

Z  u.  .  =  m  (4.1) 

i=l  1,1 


so  that  the  average  of  the  squared  distances  from  the 
centroid  is 


u  =  m/n.  (4.2) 

Also,  by  the  triangular  inequality  for  distances  one 
obtains 


<  /u . 

-  1,1 


+  ■'v  .i' 


(4.3) 


For  the  biplot  approximations  these  correpsond  to 
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Z  i  !  1 3-i  I  I  2  =  m'  (4*4) 

and 

I l3i-Sel I  1  I iSii I  +  I i2ei I  •  (4*5) 

If  the  scatter  is  roughly  evenly  distributed  I 

I 

within  radius  unity,  there  is  little  to  be  said.  If,  however,  I 

one  notes  isolated  points  in  one  direction,  with  the  remaining  j 

i 

2‘s  tightly  bunched  in  the  opposite  direction,  one  should  ; 

inspect  the  outlying  points  carefully  for  measurement  or  j 

recording  errors  or  perhaps  for  not  belonging  to  the  } 

l 

population  under  study.  If  so,  one  might  do  well  to  omit  } 

t 

such  units  from  analysis  and  concentrate  on  the  units  that  j 

have  a  reasonable  statistical  scatter.  This  would,  of 

course,  mean  recalculating  the  principal  axes  and  2  and  h 

vectors  after  omission  of  the  outliers. 

A  reasonable  criterion  for  multivariate  outliers  is 

the  distance  /u.  .  from  the  centroid.  One  does  well 
1/1 

to  look  at  the  distribution  of  these  n  distances  and  see 
if  it  indicates  some  clearly  outlying  units.  Tests  of 
significance  are  available  for  the  multi-normal  case 
(Gnanadesikan,  1977)  but  we  feel  that  these  should  be  used 
with  great  caution  unless  one  really  has  good  reason  to 
believe  that  the  data  came  from  such  a  distribution. 
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It  often  happens  that  one  finds  one  or  more  "outliers" 
but  checks  do  not  reveal  any  reason  why  those  observations 
should  be  unusual.  So  one  does  not  know  whether  these  are 
extreme  values  which  do  occur  sometimes,  though  rarely, 
in  the  given  observational  situation,  or  whether  these  are 
erroneous  records  which  do  not  belong  with  the  batch  under 
study.  One  is  in  a  quandary  as  to  whether  to  "reject"  such 
outliers  or  not.  Not  to  reject  means  including  observations 
that  manifestly  do  not  fit  the  statistical  distribution  of 
the  majority  and  vitiates  the  assumptions  underlying  most 
statistical  procedures.  To  reject  exposes  one  to  risks  of 
biasing  the  statistical  analysis  if  the  outliers  were 
extremes  from  the  same  distribution  as  the  rest  of  the 
observations.  An  honest  rule  would  be  always  to  report  at 
least  the  number  of  rejected  outliers  and  preferably  their 
entire  observations,  but  not  to  include  them  in  the  main 
statistical  analysis.  Such  rules  and  considerations  apply 
as  much  to  multivariate  as  to  univariate  data. 
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4.5.  The  distribution  of  points  in  the  £  scatter 

Some  idea  of  the  distribution  of  the  batch  over  the 
variables  may  be  obtained  from  considering  the  scatter  of 
£-points  on  the  biplot.  A  more  or  less  regular  unimodal 
distribution  should  result  in  a  reasonably  symmetrical 
biplot  scatter  with  a  concentration  of  points  about  the 
centroid  and  gradual  tapering  off  density  towards  the  edges. 

Some  other  distributions  obviously  have  different 
biplot  scatters.  A  common  case  is  that  of  multivariate  J- 
shaped  distributions  which  have  a  mode  near  zero  for  all 
variables  and  a  density  which  decreases  for  higher  values 
of  each  variable.  Such  distributions  will  produce  biplots 
of  the  type  illustrated  in  Figure  10  —  essentially  a 
quadrant  of  points  with  a  high  concentration  at  the  vertex 
and  along  the  edges  and  with  h-arrows  pointing  in  the 
direction  opposite  to  the  vertex.  The  vertex  represents 
the  zero  point  of  all  variables,  the  edges  the  zeroes 
of  particular  variables. 

To  check  such  distributions  it  is  useful  to  project 
individual  numerical  vectors,  as  in  particular  the  zero 
vector,  onto  the  biplot  scatter .  For  the  zero  Zq  =  0,* 
the  projection  is 

2o,--£,F’  (4.6) 

as  in  (2.12).  The  rough  location  of  the  zero  vector  is 
indicated  on  Figure  10  and  confirms  the  supposition  that 
these  data  are  of  a  multivariate  J-shaped  distribution. 


hypothetical 
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When  the  ^-scatter  does  not  show  any  special  pattern 
one  must  consider  the  distribution  to  be  essentially  random. 
It  is  not  that  regularities  may  not  exist,  but  that  they 
are  not  evident  from  the  scatter.  We  know  of  no  way  of 
"testing"  for  normality,  and  would  rather  tend  to  use 
the  normal  model  by  default,  as  a  viable  model  in  case 
nothing  contrary  emerges  from  biplot  inspection. 

In  effect,  the  biplot  may  provide  a  more  sensitive 
check  of  multivariate  normality  than  the  commonly  available 
tests  of  significance  which  concentrate  on  each  individual 
variable  rather  than  on  a  plane  as  the  biplot  does.  But, 
as  of  now  we  have  no  way  of  using  the  biplot  plane  for  a 
test  of  significance  on  the  shape  of  distribution. 

Another  intriguing  issue  is  whether  the  biplot  might 
be  suggestive  of  a  transformation  to  normality.  All  we 
can  say  at  this  time  is  that  strongly  skewed  distributions 
should  show  up  in  a  biplot  looking  like  that  of  Figure  10. 
Hence,  the  appearance  of  such  a  scatter  might  be  suggestive 
of  a  transformation  by  square  roots,  logarithms  or  similar 
functions.  This  subject  needs  further  examination. 


5.  JOINT  ANALYSIS  OF  VARIABLES  AND  UNITS  -  MODELLING 


5.1.  Importance  of  joint  display  in  the  biplot 

The  biplot  jointly  displays  the  configuration  of  the 
variables  (columns  of  the  data  matrix)  and  the  standardized 
scatter  of  the  units  (rows  of  the  data  matrix) .  In  doing 
both  these  things  simultaneously,  it  differs  from  many 
other  displays,  which  concentrate  on  one  feature  to  the 
exclusion  of  the  other.  Multidimensional  scaling  models 
either  the  correlation  matrix  of  the  variables  or  a  distance 
matrix  for  the  units,  but  not  both.  It  is  not  usually 
feasible  to  bring  in  the  variables  into  the  multidimensional 
scaling  of  units,  or  vice  versa.  (See,  however,  Gabriel, 

1978)  .  As  a  result,  the  analysis  and  interpretation  provided 
by  such  scaling  is  more  limited  than  that  provided  by  biplot 
representation.  Multidimensional  scales  may  have  more 
flexible  fitting  algorithms  and  are  not  restricted  by  the 
geometry  of  least  squares,  but  they  are  more  limited  in 
what  they  display. 

Some  of  the  uses  of  the  biplot  in  interpreting  units' 
clusters  in  terms  of  variables  have  been  discussed  above. 
Analogously,  correlations  can  sometimes  be  explained  in 
terms  of  the  scatter  of  units  —  in  particular,  sometimes 
a  single  outlier  in  a  particular  direction  can  account  for 
an  increased  correlation  of  the  variables  displayed  in  its 
direction.  This  is  illustrated  by  the  two  parts  of  Figure  11. 


5.2.  Diagnosing  models  by  means  of  the  biplot 


Approximate  functional  fits  of  the  data  matrix  may  at 
times  be  identified  by  inspection  of  the  biplot.  Thus, 
if  Z  is  approximately  additive,  i.e.,  if 


z 


i,v 


z  +  a .  +  b  +  e . 

1  v  i,  v 


(5.1) 


for 


z  = 


z  z 

i  v 


z . 
i,v 


/nm 


(5.2) 


and  some  a^,...,an,  b^,...,b  a$d  small  e's,  then  the 

biplot  of  Z  —  or  of  ((z.  -  5)  )  —  will  have  the 

i ,  v 

following  simple  form:  The  ^-markers  will  be  close  to 
one  straight  line,  the  h-markers  close  to  another  such 
line  and  these  two  lines  will  be  at  90°  to  each  other. 
Conversely,  when  the  biplot  markers  display  such  a  pattern, 
additivity  can  be  inferred. 

What  is  more,  if  some  row  markers  are  on  one  line  and 
some  column  markers  are  on  another  line  which  is  at  90°  to 
the  first,  then  one  can  infer  that  additivity  holds  for 
the  sub-matrix  of  the  corresponding  rows  and  columns.  For 
an  illustration,  consider  the  artificial  air  pollution  data 
of  Table  13  which  is  biplotted  in  Figure  12.  It  is  immediately 
evident  that  the  heads  of  the  h  arrows  for  the  four 
years  are  very  close  to  collinear  and  that  the  £-point  for  six 
of  the  stations  are  close  to  another  line,  pretty  much  at  90° 
to  the  h-arrowhead  line  —  only  the  £-point  for  station  F  is 
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TABLE  13 


An  Air  Pollution  Index  at  Seven  Localities  1960-75  (Artificial  Data) 


Station 

1960 

1965 

1970 

1975 

A 

100 

102 

105 

110 

B 

98 

99 

104 

108 

C 

107 

110 

112 

116 

D 

98 

100 

103 

106 

E 

86 

90 

91 

95 

F 

103 

100 

94 

89 

G 

111 

111 

115 

119 
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far  away  from  this  line.  One  may  therefore  safely  diagnose 
an  additive  model  for  the  6x4  table  obtained  by  omitting 
station  F.  Inspection  of  the  Table  will  show  that  this  is  indeed 
appropriate . 

This  diagnostic  method  extends  to  some  other  models  as 

well  (Bradu  and  Gabriel,  1977) .  In  particular,  if  the 
two  above  lines  intersect  at  an  angle  other  than  90°,  then  a 

Tukey  degree-of-freedom-for-non-additivity  model  holds, 

i.e. , 

z.  =  z  +  a.  +  b  +  Xa.b  +  e.  (5.3) 

i ,  v  l  v  iv  i ,  v 

for  some  X . 

This  diagnostic  use  of  the  biplot  may  be  quite  impor¬ 
tant  since  statisticians  do  not  in  general  have  adequate 
tools  for  such  diagnosis.  Statistics  textbooks  generally 
give  methods  of  estimating  parameters  and  testing  fit  of 
a? ven  models,  but  do  not  usually  provide  techniques  of 
choosing  a  model. 

Biplot  diagnosis  of  models  rests  on  the  matrix  decomposition 

Y  agx  GH'  (5.4) 

The  rows  of  the  latter  two  matrices  are  displayed  in  the 
biplot  where  visual  inspection  may  lead  to  diagnoses  of 
simple  geometric  descriptions.  When  these  descriptions 
are  formulated  algebraically  they  can  be  entered  into  (5.4) 
and  may  be  translated  into  a  model  for  the  data  matrix 


itself . 


i 


5.3.  An  example  of  modelling  by  means  of  the  biplot 

As  an  example  of  the  biplot's  usefulness  in  modelling, 
consider  the  case  where  the  vertices  of  the  h-arrows  are 
close  to  an  ellipse.  Writing  p.  for  the  center  of  the  ellipse, 
a  and  0  for  unit  vectors  along  its  principal  axes,  this  means 


that  there  exists  0V  for  each  v,  such  that 


h  =  y_  +  a  cos  @v  +  0_  sin  0V 


Matrix  H  therefore  becomes 


1 

•  •  • 

l 

•  •  • 

1 

cos 

CD 

• 

• 

• 

cos 

CD 

< 

• 

« 

• 

cos  0 

m 

sin 

91  ... 

sin 

• 

• 

• 

> 

CD 

sin  0 

m 

and  the  data  matrix  is  approximated  by 


Y  apx  G(u,a,j3) 


.  .  .  cos  0 


sin  0 


Thus,  the  i-th  row  is  approximated  by 


|£x  (g|p,  £ [a, 


1 

cos  0 

\ 

sin  0 


(5.4) 


(5.5) 


(5.6) 


(5.7) 


■T’5’ 


5/9 


Writing 


sin  4>i 


r 


COS 


♦i 


9 


and 

58  SLi  £  ,  (5.8a) 

Yi  -  ,  •  (5.8b) 

&L  =  .  (5.8c) 

These  approximations  become 

*i,v  ni  +  Yi  sin0v  +  6i  cos0v  f  (5.9) 

or,  defining 

*i  =  4?  +  a?  (5.io) 

'  1  1 

and  4>i  =  arc  tan  (5.11) 

they  become  yi  v  agx  ^  cos(0v-$i).  (5.12) 


Thus,  observation  of  the  elliptical  form  of  the 
h-conf iguration  has  led  to  diagnosing  a  harmonic  model 
for  the  data  matrix  with  constant  and  amplitude  depending 
on  the  rows  and  phase  on  the  columns.  An  example  where 
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such  a  model  was  appropriate  was  given  in  Section  2.6 
where  the  elliptical  configuration  of  the  months'  arrows 
was  found.  Those  annual  temperature  data  could  therefore 
be  fitted  by  a  harmonic  model  with  constants  and  amplitudes 
depending  on  the  station  and  the  phase  depending  on  the 
month  (Gabriel  and  Tsianco,  1980) . 

Let  it  be  stressed  that  the  function  of  the  biplot, 
or  bimodel,  is  merely  to  suggest  a  suitable  model,  not 
to  provide  estimates  of  its  parameters.  Once  a  model 
such  as  (5.1),  (5.3)  or  (5.11)  is  suggested,  standard 
estimation  techniques  should  be  reverted  to,  such  as 
least  squares  or  its  robust  counterparts.  (We  will  not 
discuss  the  fitting  of  the  harmonic  model  here  —  see, 
however,  Gabriel  and  Tsianco,  1980)  . 


1 


1 


3 
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6.  COMPARING  SEVERAL  BATCHES  OF  OBSERVATIONS 

6.1.  Joint  inspection  of  the  two  batches'  scatters 

Observations  coming  from  several  different  sources, 
or  populations,  need  different  methodology  and  analysis 
than  single  batches  of  multivariate  data.  For  each  single 
batch  of  data,  one  may  be  concerned  with  description  and 
analysis  of  the  configuration  of  variables  and  of  the  scatter 
of  units  and  with  consideration  of  distributions,  outliers, 
models,  and  other  summarizations .  This,  of  course,  may 
also  be  of  interest  when  several  batches  of  data  are 
available,  but  the  new  aspect  that  appears  at  this  stage 
is  that  of  comparing  batches .  Problems  now  arise  with  the 
search  for,  and  identification  of,  characteristics  on  which 
the  batches  differ,  with  the  measurement  of  "distances" 
between  batches,  with  the  appraisal  of  the  significance 
or  possible  randomness  of  observed  differences  and  with  the 
classification  of  additional  units  as  being  similar  to  one 
or  another  of  the  batches  according  to  these  units'  multi¬ 
variate  observations. 

One  may  begin  with  the  most  straightforward  case  of 
two  batches.  In  comparing  two  batches  one  generally  ignores 
the  individuality  of  the  units  and  regards  them  as  mere 
members  of  one  batch  or  the  other.  Since  each  batch  is 
regarded  as  a  sample  from  some  population  or  distribution 
the  units  lose  their  identities  and  become  mere  replicate 


observations 


In  comparing  two  samples,  and  in  using  them  for  testing 
population  differences,  one  considers  the  within  sample, 
inter-unit  differences,  mainly  as  providing  estimates  of 
random  variation,  or  "noise,"  against  which  to  judge  inter¬ 
sample  differences  (averaged  over  units) .  Thus,  in  a 
comparison  of  1977  winter  storms  with  1978  winter  storms, 
the  individual  storms  of  each  year  are  averaged  for  the  main 
comparison,  and  the  variability  from  storm  to  storm  within 

each  year  serves  as  a  yardstick  against  which  one  may  measure 

% 

the  averages'  comparison.  A  study  of  the  special  features 
of  individual  storms  of  either  season  would  be  part  of  each 
batch's  analysis,  not  part  of  the  batch-to-batch  comparison. 

As  an  example,  consider  the  data  in  Table  14  relating  to 
26  storms  occurring  in  the  summer  of  1973.  Pielke  and 
Biondini  (1977)  treated  these  as  two  batches  of  storms,  13 
with  geostrophic  wind  speed  above  3m/sec  and  13  with  slower 
geostrophic  wind  speed.  In  comparing  these  two  batches, 
the  individual  storms  are  averaged  for  comparison,  and  the 
storm  to  storm  variation  within  each  batch  provides  estimates 
of  random  variability.  A  study  of  the  special  features  of 
each  batch's  individual  storms  is  not  a  main  part  of  the 
batch-to-batch  comparison.  Table  15  gives  the  five-variate 
means  of  each  batch  and  the  variance-covariance  estimates 
from  each  batch. 


Wind 


Date 

R 

Speed 

July 

3 

49.61 

4m 

sec  ^ 

July 

4 

172.83 

2m 

sec  1 

July 

5 

20.72 

2m 

sec  ^ 

July 

6 

59.93 

4m 

sec  ^ 

July 

7 

26.12 

3m 

sec 

July 

15 

75.48 

3m 

sec 

July 

16 

51.71 

5m 

sec~l 

July 

17 

56.33 

6m 

sec  ^ 

July 

18 

23.66 

5m 

sec  ^ 
-1 

July 

20 

62.95 

3m 

sec 

July 

23 

31.13 

6m 

sec  ^ 

July 

24 

17.09 

10m 

sec-1 

July 

25 

14.61 

6m 

sec-1 

-1 

July 

26 

37.29 

2.5m 

sec 

-1 

July 

27 

84.05 

lm 

sec 

-1 

July 

28 

77.22 

lm 

sec 

July 

29 

108.71 

0 

July 

31 

93.88 

6m 

sec  ^ 

Aug 

1 

38.66 

lm 

sec  1 

Aug 

2 

75.61 

3m 

sec  1 

Aug 

6 

79.98 

4m 

sec  ^ 

Aug  10 
Aug  11 
Aug  12 
Aug  13 
Aug  14 


127.04 

24.85 

17.66 

33.15 

97.53 


3m  sec" 
4m  sec 
8m  sec 
7m  sec 
3m  sec 


Direction 


90° 
70° 
225° 
270° 
135° 
100° 
170° 
100° 
100° 
135° 
120° 
90° 
85° 
110° 
180° 
170° 
180° 
180° 
18  08 
190° 
100° 
140° 
135° 
90° 


19 

5.99 

432 

32 

5.98 

453 

54 

5.77 

371 

32 

10.77 

515 

48 

11.58 

494 

82 

8.41 

336 

68 

14.14 

267 

42 

10.57 

477 

81 

13.36 

326 

76 

12.70 

357 

48 

9.35 

539 

96 

10.44 

404 

81 

3.64 

356 

74 

11.57 

304 

47 

11.04 

316 

39 

9.22 

329 

53 

12.97 

295 

20 

16.59 

280 

16 

15.53 

252 

13 

12.54 

242 

70 

15.53 

317 

28 

4.46 

564 

58 

9.59 

401 

81 

15.85 

427 

43 

9.19 

338 

50 

7.80 

532 

surface  geostrophic  wind  speed  >  3m/sec;  slow:  other 
rainfall  20  log 

surface  level  geostrophic  wind  direction 
gradient  of  equivalent  potential  temperature 

difference  between  saturation  equivalent  potential  temperature 
and  equivalent  potential  temperature 

depth  of  convective  instability 
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TABLE  15 


Means  and  Variances-Covariances  of  Storm  Data  (Transformed)* 


Fast  Geostrophic  Wind 


(Speed  >  3 m/sec)  -  13  storms 


R 

D 

T 

S 

P 

Means 

215.10 

125.38 

248.03 

111.55 

294.82 

St.  Devs. 

35.17 

51.12 

54.96 

37.03 

31.43 

Variances  -  Covariances 

R 

1237.06 

910.08 

-405.95 

540.62 

-194.38 

(Correlations  below 

D 

.  5  0  6| 

2613.31 

-327.15 

470.87 

139.69 

diagonal) 

T 

-.210 

-.116 

3020.83 

1032.91 

-999.24 

S 

.415 

.249 

.408 

1370.91 

-459.90 

P 

-.176 

.087 

-.579 

-.395| 

987.61 

Slow  Geostrophic  Wind 

(Speed  < 

3m/sec) 

-  13  storms 

R 

D 

T 

S 

P 

Means 

251.09 

150.00 

244.35 

99.67 

287.04 

St.  Devs. 

36.05 

40.76 

52.47 

31.85 

38.27 

Variances  -  Covariances 

R 

1299.37 

-603.74 

-719.94 

-330.53 

296.38 

(Correlations  below 

D 

-.411 

1661.54 

499.50 

321.12 

-649.04 

diagonal) 

T 

-.381 

.  2  3  4  j 

[  2752.82 

1183.67  - 

■1472.29 

S 

-.288 

.247 

.708| 

1014.57 

-855.09 

P 

.215 

-.416 

-.733 

- .  70l| 

1646.74 

♦Transformations:  R  *■ 

60 ZnR;  D  1 

D;  T  1000/T;  S  <- 

10S ;  P  •<- 

15/P. 

Note  that  the  means  and  variances  in  Table  15  do  not 
relate  directly  to  the  variables  in  the  form  computed  by  Pielke 
and  Biondini  —  Table  14  —  but  to  various  transforms  of 
these.  A  preliminary  check  by  means  of  probability  plots  showed 
three  of  the  variables  to  have  very  skew  distributions, 
especially  the  first  one.  Transformation  by  a 
fractional  power  was  therefore  indicated.  After  some  trial 
and  error,  transformations  were  chosen  which  produced  reasonably 
symmetric  distributions.  (The  constants  by  which  the  variables 
are  multiplied  were  chosen  so  as  to  approximately  equalize 
the  variances  —  this  is  important  for  biplotting:  if  the 
biplot  were  fitted  to  non-standardized  variables,  the  method 
of  least  squares  would  produce  a  good  fit  for  the  variables 
of  large  magnitude  and  all  but  ignore  the  variables  of  smaller 
magnitude. ) 

Such  preliminary  inspection  and  transformation  of 
variables  is  quite  important.  Without  it  one  might  apply 
least  squares  methods  to  variables  which  are  highly  skewed  and 
for  which  these  methods  would  be  quite  unsuitable. 

One  way  of  representing  two  batches  of  multivariate 
observations  is  by  regarding  them  as  distinct  scatters 
of  units  in  the  same  space  of  variables.  An  approximating 
display  —  GH’  biplot  —  may  be  constructed  for  the  matrix 
of  both  batches'  multivariate  observation  and  the  ^-points 
of  the  two  batches  may  be  distinguished  on  the  biplot  by 
some  special  marks  or  colors.  The  summer  1973  storms  are 
biplotted  accordingly  in  Figure  13  —  again  using  the  data 
transformed  as  noted  in  Table  15. 
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Figure  13:  Biplot  of  Storm  Data  (Table  14) 
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It  is  immediately  evident  from  this  biplot  that  the 


scatters  of  the  two  types  of  storm  differ.  The  most  obvious 

difference  is  that  the  g-points  for  "fast"  storms  are  mostly  higher 

up  on  the  biplot  than  the  g-points  for  the  "slow"  storms.  The 

vertical  direction  is  that  of  the  two  variables:  R-rainfall  and  D-wind 

direction.  Evidently  slow  storms  have  more  rainfall  and 

higher  angle  of  wind  direction  than  fast  storms.  The 

two  superimposed  batch  scatters  are  examined  for  differences  in 

distributions.  If  the  two  scatters  are  completely  disjoint,  one 

may  be  sure  of  clear  between- sample  difference.  If  there  is 

some  overlap,  the  distinction  is  less  obvious  and  may  need 

testing  for  significance  —  more  about  that  later.  At  this 

stage,  one  does  well  to  inspect  the  shape  of  the  two  scatters 

as  well  as  their  approximate  location.  If  the  centroids  differ, 

this  indicates  a  difference  in  mean  level  of  some  variables; 

the  particular  variables  can  be  identified  by  considering 

the  vector  from  one  of  the  centroids  to  the  other  and  projecting 

it  onto  the  h-arrows  to  look  for  long  intercepts.  If  the 

extent  or  shape  of  the  two  scatters  differs,  this  indicates 

different  variability;  the  particular  variables  on 

which  the  variability  differs  are  indicated  by  identifying  the 

h-arrows  in  the  direction  of  differing  scatter. 

An  aid  in  inspecting  and  comparing  the  scatter  of 
samples  of  points  is  the  construction  of  concentration  ellipses. 

For  a  batch  of  rn  units  whose  biplot  g-points  coordinates  are 

(g.  .,g.  _) ,  i=l,...m,  the  concentration  ellipse  is  defined  as  the 

1  /  -L  1  /  ^ 

locus  of 


2  +  §.]_  cos6  +  6_2  sin9  0  <  6  <  2tt  (6.1) 
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with  center  at  point 


m 


2'  =  (1/m)  S  (SLi,!'^1 
i=l 


(6.2) 


and  and  §_2  being  obtained  as  follows;  Calculate  matrix 


v  =  k 


l  2  z 
i  gi,l  i  gilgi2 

i  gilgi2  i  gi2 


f-2  -  - 

gl  glg2 

glg2  g2 


(6.3) 


solve 


=  A  Q 


( v=l, 2) 


(6.4) 


2  2 

for  the  maximum  and  minimum  eigen-values  A ^  and  A 2,  respectively, 
and  set 


iv  =  XV2V  <v=l,2)  .  (6.5) 

The  center  of  the  ellipse  —  and  the  centroid  of  the 
batch  of  n  points  —  is  at  £'  and  the  maximum  and  minimum 

diameters  are,  respectively,  of  lengths  2X 1  and  2 \2  and  in  directions 
21  and  £2  from  the  centroid. 

The  concentration  ellipses  for  storms  of  each  type 
are  drawn  onto  the  biplot  in  Figure  14 .  They  clearly  show  the 
vertical  displacement  of  the  two  samples,  confirming  the  impression 
gained  from  inspection  of  the  2“P°ints  themselves.  They  also 
indicate  no  horizontal  displacement,  confirming  that  the 
two  types  of  storms  do  not  differ  appreciably  on  variables 
T,  S;  and  P.  (The  correctness  of  these  graphical  impressions 
can  be  verified  from  the  means  in  Table  15.) 
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Figure  14:  Biplot  of  Storms  With  Identification  of  Fast  (1) 
and  Slow  (2)  Storms  and  Concentration  Ellipses 
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In  addition  to  this  comparison  of  centroids,  one  may  use 
the  shape  of  the  concentration  ellipses  to  compare  the  variance 
and  correlation  configurations  of  the  two  batches.  In  Figure  14 
the  ellipse  for  the  slow  storms  is  considerably  squatter 
and  slightly  wider  than  that  of  the  fast  storms.  Recalling  the 
relation  of  scatters  to  correlations,  one  may  infer  that  the 
T,  S  and  P  correlations  would  not  differ  much  between  the 
batches,  but  that  the  correlations  of  R  and  D  with  each  other  and 
with  T,  S  and  P  might  differ.  The  biplot  suggests  that  the 
R,  D  correlation  is  higher  for  fast  storms  than  for  slow  ones  — 
and  that  indeed  is  the  most  striking  difference  between  the 
two  correlation  matrices  in  Table  15.  It  also  suggests  that 
both  R  and  D's  correlations  with  T,  P  and  5  are  smaller  in 
magnitude  for  fast  storms  than  for  slow  storms,  though  the 
signs  remain  the  same.  This  does  not  clearly  reflect  the 
actual  correlations  in  Table  14.  Evidently,  comparison  of  ellipses 
can  be  used  to  indicate  the  existence  of  differences  in  variances 
and  correlations,  but  it  is  difficult  to  use  it  to  infer  what  the 
actual  differences  in  configuration  are. 

A  more  sensitive  display  of  differences  in  the  configurations 
of  two  batches  may  be  obtained  by  biplotting  each  batch  separately 
and  superimposing  their  h-conf igurations  (the  £  scatters 
are  of  no  interest  in  this  context) .  These  h-plots  (Corsten 
and  Gabriel,  1976)  allow  more  detailed  comparisons.  Thus, 
for  the  two  batches  of  1973  storms,  the  two  h-conf igurations  are 
superimposed  in  Figure  15.  Note  that  with  a  slight  rotation 


four  of  the  five  pairs  of  h-vectors  can  be  made  to  overlap 
pretty  well.  The  obvious  exception  is  the  hR  vector  which  is 
in  almost  opposite  direction  in  the  two  configurations  —  its 
correlations  with  the  other  variables  must  be  of  virtually 
opposite  signs  in  the  two  batches.  This  agrees  pretty  well 
with  Table  15. 

We  recapitulate  the  1  thods  of  comparing  batches  of 
multivariable  data.  To  begin  with,  one  should  check  the 
batch  scatters  to  see  whether  they  are  reasonably  elliptic  in 
character.  If  there  seem  to  be  few  outliers,  long  tails 
and/or  strong  concentration  at  one  edge  or  corner  (likely 
to  be  the  zero  point  of  the  variables  if  the  measurements  are 
all  non-negative)  then  the  data  should  be  readjusted  in  a 
manner  similar  to  single  batch  data  with  such  properties.  If 
the  variability  seems  to  be  systematically  longer  for  the 
batch  with  larger  means,  transformations  may  be  called  for.  A 
comparison  of  log  (standard  deviations)  against  log  (mean)  may 
show  a  fixed  slope  for  some  of  the  variables  —  these  variables' 
observations  are  likely  to  be  more  regularly  scattered,  i.e., 
have  more  equal  variabilities,  if  they  are  re-expressed  as 

(variable)  ^  s^-°Pe  _  (re-expressed 

variable) 

Such  re-expression  is  a  rough  and  ready  method  and  the 
exponent  should  in  general  be  rounded  to  the  nearest  1/2. 

(Note  that  for  a  slope  around  1,  (.)1-s^0Pe  is  to  be  read  as 
log { • ) ) •  (Tukey,  1977,  Chapters  3  and  4). 
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6.3.  Comparison  of  three  or  more  batches 

When  multivariate  observations  occur  in  several  batches, 
or  have  been  a  priori  classified  into  several  categories, 
one  may  wish  to  compare  these  several  sets  of  observations. 
Essentially,  such  comparisons  of  three  or  more  batches  are 
analogous  to  the  comparison  of  two  batches  as  described 
above.  There  are  essentially  two  approaches  to  such  com¬ 
parisons:  '(1)  comparing  the  several  batches  configurations 
of  variables  without  reference  to  the  location  of  the  scatters, 
or,  (2)  comparing  the  several  batches'  unit-scatters  against 
the  background  of  one  configuration  of  variables.  These  two 
approaches  correspond  to  univariate  comparisons  of  scale 
and  location. 

For  comparisons  (1)  of  configurations  one  would  need 
separate  variance-covariance  configurations  to  be  obtained  and 
displayed  for  each  batch.  Thus,  the  h-conf iguration  of  each 
batch  would  be  obtained  from  its  GH ' -biplot.  These  configurations 
might  then  be  displayed  alongside  one  another  and  compared 
visually.  If  the  number  of  variables  is  not  very  small,  such 
visual  comparisons  may  be  quite  difficult.  Section  6.2  illustrated 
a  comparison  of  two  batches'  4-variate  configurations.  Consider  how 
much  more  difficult  the  comparison  of  six  batches  would  be  if  10- 
variate  configurations  were  displayed  for  each.  Unfortunately 
we  cannot  suggest  a  simpler  way  of  making  such  comparisons. 

They  are  complex  and  perhaps  cannot  be  further  simplified. 


For  location  comparisons  (2)  one  might  begin  by  pooling 
all  batches  to  obtain  an  overall  estimate  of  the  variables' 
configuration,  i.e.,  the  h-conf iguration  in  the  GH'-biplot 
of  all  the  data.  One  would  then  compare  the  batches  by 
classifying  the  ^-points  according  to  the  batches  whose 
units  they  represent.  And  again,  as  in  Section  6.1,  one 
might  summarize  the  scatter  of  each  batch  by  a  concentration 
ellipse.  The  comparison  of  batch  scatters  is  then  conveniently 
done  by  inspecting  the  locations  and  shapes  of  the  different 
ellipses  as  in  Section  6.1. 

An  alternative  approach  to  location  comparisons  is  to  compare 
only  the  multivariate  means  of  the  several  batches.  A  suitable 
metric  for  such  comparisons  is  that  of  the  "within  batches"  sums 
of  squares  of  products  (its  use  assumes  that  the  variance- 
covariance  configuration  of  the  different  batches  are  much  the  same) . 
Thus,  a  biplot  of  the  means  of  the  several  batches  would  show 
which  batches  differ  from  what  other  batches  and  on  what  variables 
these  differences  are  evident. 

Such  an  approach  is  analogous  to  MANOVA  (multivariate 
analysis  of  variance) .  An  application  to  meteorology  has  been 
studied  in  the  context  of  the  Israeli  rainfall  stimulation 
experiment  (Gabriel,  1972) . 


6.4.  Classification  of  new  data  into  given  categories 


Discriminant  analysis 

A  common  situation  requires  the  classification  of  a 
new  unit  into  one  of  several  populations  from  which  it 
might  have  originated.  Thus,  storms  may  be  of  a  number 
of  synoptic  types  and  radar  observations  may  be  available 
for  batches  of  earlier  storms  of  each  type.  A  new  storm 
now  occurs  and  one  is  asked  to  use  its  radar  observations 
in  order  to  allocate  it  to  one  synoptic  type.  Statistically, 
one  would  want  to  classify  the  new  storm  into  the  type 
whose  batch's  radar  observations  match  the  new  storm's 
observations  most  closely.  That  essentially  is  the  problem 
statisticians  refer  to  as  "discrimination"  and  the  techniques 
they  use  go  under  the  name  of  discriminant  analysis.  The 
subject  is  too  large  to  explore  here:  Instead  we  refer 
the  reader  to  Miller's  (1964)  monograph,  written  for  meteo¬ 
rologists,  to  Lachenbruch ' s  (1975)  volume  on  discriminant 
analysis  and  to  Gabriel  and  Pun's  (1978)  description  of 
and  program  for  two  category  discrimination  by  logistic 
techniques . 
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6 . 5  An  Example  -  Different  Techniques  for  Comparing  Batches 

To  evaluate  the  three  different  ways  of  comparing  several 
batches  of  multivariate  data,  we  will  study  an  example  in  some 
detail.  We  use  historical  data  of  annual  precipitation  in 
Illinois  to  simulate  a  weather  modification  situation.  Suppose 
a  "cloud  seeding"  operation  had  taken  place  in  the  years  1955  - 
1960  in  the  Southern  Illinois  area,  and  that  another  such  operation 
had  been  carried  out  during  1970-78  in  the  Northeastern  part  of 
Illinois.  Also,  suppose  that  no  cloud  seeding  was  carried  out  in 
Illinois  at  any  other  time  or  place.  Central  Illinois  precipitation 
could,  therefore,  serve  as  concomitant  observations  to  indicate 
"natural"  precipitation;  it  would  not  have  been  "seeded"  in 
either  period.  (The  quotes  are  used  since  the.  data  relate  to 
simulated  "operations",  not  to  real  ones). 

To  evaluate  the  effect  of  both  "operations"  it  may  be 
proposed  to  use  50  years'  data,  1929-78,  for  the  following  five 
stations:  Dubuque  and  Moline  to  represent  Northeastern  Illinois, 

"seeded"  in  Period  IV  —  1970-78;  St.  Louis  to  represent 
Southern  Illinois,  "seeded"  in  Period  II  —  1955-60;  Peoria  and 
Springfield  to  represent  Central  Illinois  --  never  "seeded". 

These  50  years  also  provide  two  "unseeded"  periods  for 
comparison,  i.e.,  I  -  1929-1954  and  III  -  1961-1969,  as  set  out 
in  Table  16  The  corresponding  data  for  annual  precipitation  are 
shown  in  Table  17  Note  that  these  are  actual  precipitation 
data  except  that  in  the  "operational"  years  each  "target"  station's 
precipitation  was  augmented  to  simulate  effects  of  "seeding". 

We  are  using  simulated  data  for  illustration  because  that 
allows  us  to  anticipate  the  findings  and  then  see  how,  and  to  what 
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Table  16 

Areas  and  Periods  of  "Operations" 

and  Comparisons 

Period 

No.  of  Years 

Southern 

Illinois 

"Target" 

(St.  Louis) 

Northeastern 

Illinois 

"Target" 

(Dubuque, 

Moline) 

Central 

Illinois 

Control 

(Peoria, 

Springfield) 

I. 

1929-54 

26 

Unseeded 

Unseeded 

Unseeded 

II. 

1955-60 

6 

"Seeded" 

Unseeded 

Unseeded 

III. 

1961-69 

9 

Unseeded 

Unseeded 

Unseeded 

IV. 

1970-78 

9 

Unseeded 

"Seeded" 

Unseeded 

Total 

50 

Table  17  Annual  Precipitation  at  5  Illinois  Stations  1929-78 


DUB  MOL  PEO  SPR  STL 


24, 

.  1  ;'.i: 

'•  / 

H  , 

.710 

" 

.  0C(: 

.  cso 

4'  . 

:■(  5 

; 

oo 
*. . 

.  '  5(. 

30 

.  0  1  0 

24 

.0  30 

24 

.  320 

2’’ 

7  30 

j] 

29, 

.5  :0 

51 

.  390 

'7 

.751' 

36 

.210 

37 

300 

T  '/ 

25, 

.  3  70 

34 

.490 

:  T 

.660 

32 

.050 

38, 

01( 

\  •; 

28, 

.  700 

20. 

.310 

:  4 

.  (.7(i 

36 

.470 

24 

770 

34 

34, 

.500 

36 

.850 

30 

.450 

25 

.  680 

29, 

K'O 

35 

32 

.  550 

35 

.  5S0 

40, 

.  1  50 

41  , 

.  220 

39. 

3  6  0 

36 

26 

.  7  71 

90. 

.  080 

"0. 

.910 

26, 

.  920 

26. 

140 

3  7 

31  , 

.  770 

30, 

.960 

29 

.  890 

34, 

.630 

35. 

870 

38 

47. 

.630 

43, 

.  750 

42 

.620 

36, 

.980 

41  , 

2  20 

9 

29, 

.890 

28, 

.  500 

38, 

.  270 

33. 

.050 

40. 

150 

40 

1 

.  500 

25 

.  200 

24  , 

.  160 

22. 

.680 

25. 

000 

4) 

32, 

.  500 

36, 

.  940 

42 

.290 

44  , 

.  720 

32, 

1  20 

42 

35, 

.  570 

32 

.  380 

37 

.  860 

42  . 

.  360 

45. 

140 

4? 

31  , 

.520 

32, 

.  160 

32, 

.810 

32. 

.  360 

3  3 , 

600 

44 

42, 

.  500 

38 . 

.930 

35 

.930 

33. 

,  280 

~  •> 

510 

45 

3S . 

.  760 

o  " 

.840 

36, 

.  130 

43. 

.400 

49. 

320 

46 

32, 

.510 

38. 

.  320 

38, 

.890 

39 . 

.910 

57. 

120 

47 

42. 

.  280 

35, 

.630 

39, 

.  170 

36. 

.480 

35. 

780 

48 

2  3  , 

.350 

34. 

.  350 

30, 

.  130 

30. 

.860 

42. 

260 

49 

31  . 

.510 

34, 

.  5o0 

33. 

.  330 

37. 

.520 

45. 

760 

50 

32. 

.  330 

32. 

.  880 

37. 

.  300 

32. 

.050 

37 . 

2  30 

51 

45. 

.010 

48, 

.600 

37. 

.230 

38. 

.510 

36 . 

- »  n 

52 

27  . 

.260 

28. 

.640 

35, 

.430 

:o. 

.  390 

25. 

6  70 

53 

34. 

.550 

26. 

.470 

28. 

.6  30 

23. 

.980 

20  . 

590 

54 

38. 

.210 

38. 

.860 

41 . 

.960 

26. 

.670 

27. 

(.  1 0 

55 

26 . 

.070 

26. 

,090 

25  . 

.  990 

34. 

.  150 

40. 

729 

56 

24. 

,080 

20. 

.  200 

25. 

.620 

31  . 

.210 

44  . 

759 

57 

38  . 

.320 

32, 

.920 

3  0 

.  990 

41  . 

.970 

61. 

SOS 

53 

2ft. 

.070 

24. 

.450 

31 

.450 

30. 

.560 

48. 

594 

59 

54. 

.  560 

42. 

.  100 

30, 

.  630 

25. 

.  980 

36. 

£03 

60 

43. 

.  360 

39. 

.450 

37, 

.630 

38. 

.910 

41  . 

714 

61 

63 . 

.390 

45. 

.900 

39, 

.450 

.910 

41 . 

2  00 

6  2 

42. 

.  770 

33. 

.850 

24. 

.820 

30. 

.620 

24. 

610 

63 

35. 

.440 

30. 

.  780 

25. 

.660 

26. 

.890 

28. 

620 

64 

36. 

.  140 

35, 

.070 

28, 

.950 

31  . 

.020 

32. 

160 

v)5 

61 , 

.420 

49, 

.590 

48, 

.  260 

29. 

,CS0 

28. 

260 

6o 

39. 

.  230 

5  7 . 

.  6o0 

33. 

,  140 

30. 

.  700 

32. 

340 

67 

52. 

.970 

42. 

,  360 

35. 

.  95r 

36. 

,210 

41 . 

300 

68 

39. 

,  SCO 

31 . 

.350 

~  ^  . 

,890 

21  . 

.670 

52. 

490 

69 

33. 

.  700 

41  , 

,  790 

•3. 

.  700 

34 . 

SO 

43. 

7  20 

70 

47  . 

.801 

67. 

.  236 

44. 

.  720 

33. 

250 

36. 

2U0 

71 

48. 

.217 

49, 

.972 

26. 

.  38  0 

27. 

,6  20 

~ 

730 

72 

51  . 

,  714 

60. 

,645 

36. 

,  230 

32. 

020 

33. 

740 

73 

51 . 

.426 

73  . 

.  26S 

50. 

,227 

44  . 

29C 

39. 

£20 

7  4 

50. 

,154 

60, 

,679 

42. 

.510 

40. 

.8  20 

36. 

830 

75 

42. 

.  26  3 

i  1  , 

.635 

41  , 

,220 

5  7 . 

,660 

40  . 

210 

76 

30 . 

,654 

32. 

,461 

3!  . 

,  230 

25. 

700 

23. 

460 

77 

50. 

.739 

54. 

.  548 

38. 

,410 

42. 

710 

43. 

410 

7S 

40. 

.300 

40, 

.651 

32. 

.090 

31  . 

f  30 

37. 

710 

NOTE:  These  are  actual  precipitation  data  as  obtained  from  the 

Illinois  State  Water  Survey,  except  for  the  1955-60  figures  for 
St.  Louis  and  the  1970-78  figures  for  Dubuque  and  Moline  which 
are  equal  to  130%  of  the  recorded  natural  precipitation. 
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extent,  the  analyses  recover  the  simulated  "effects".  Thus,  we 
should  expect  that  during  "operations"  the  precipitation  at  target 
stations  would  be  higher  and  more  variable. 

We  begin  by  examining  the  entire  data  set,  irrespective  of 
batches,  i.e.,  operational  or  other  years.  Means,  standard 
deviations,  covariances  and  correlations  are  shown  in  Table  18, 
and  the  co-ordinates  for  the  GH' -biplot  are  given  in  Table  19 
the  biplot  having  been  fitted  to  residuals  from  the  5-variate  centroid. 
This  biplot  is  displayed  in  Figure  16 

Mean  precipitation  —  Table  18  —  is  pretty  uniform  over  the 

five  Illinois  stations  —  perhaps  a  little  lower  in  the  center. 
Variability  changes  more  strikingly,  the  standard  deviations  being 
appreciably  lesser  in  Central  Illinois.  Correlations  reflect  the 
geographical  location,  the  highest  correlations  being  found  for 
adjacent  stations,  i.e.,  Dubuque  with  Moline,  Moline  with  Dubuque 
and  to  a  lesser  extent  with  Peoria,  Peoria  with  Springfield  and 
St.  Louis  with  Springfield.  Generally,  correlation  tapers  off 
with  distance  between  stations  —  thus  the  St.  Louis  correlations 
with  Dubuque  and  Moline  are  very  low. 

The  biplot  —  Figure  16  —  reflects  this  configuration  of 

variation  and  covariation  (since  this  GH' -biplot  is  mean-centered 

it  conveys  no  information  on  means) .  The  h-arrow  for  the  Central 
Illinois  stations  are  shorter  (less  variability)  than  those  for 
the  stations  in  North  and  South  Illinois.  The  order  of  the 
arrowheads  reflects  the  geographical  location  of  the  five  stations 


and  so  the  angles  subtended  at  the  centroid  are  smaller  for  nearby 
stations  and  larger  for  far-away  stations:  Thus,  the  cosines 
decrease  with  distance,  reflecting  the  decrease  of  correlations 
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Table  18  Measure  of  Location  and  Dispersion  of  the  Entire  Data  Set 


Means 

Standard  Deviation 


Station 

Dubuque  Moline 
38.111  37.897 

9.626  10.805 


Peoria  Springfield 
35.043  34.503 


6.013 


5.474 


St.  Louis 
37.507 
8.203 


— ■— — _£ovar  l  ance  s 
Correlations — - 


77.643 

24.937 

18.916  1.022 

7465 

39.215 

26.563  5.620 

4308  .6036 

22.878  13.405 

3590  .4491 

.6951 

27.287 

0129  .0634 

.2717  J 

- rS7T77 - , 
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Table  19  GH' -biplot  Co-ordinates  for  Entire  Set 


i 

gU 

gi2 

i 

1929 

-.083 

-.193 

1946 

30 

-.182 

.232 

7 

1 

-.085 

-.068 

8 

2 

-.104 

-.046 

9 

3 

-.124 

-.034 

50 

4 

-.048 

.099 

1 

5 

-.017 

-.114 

2 

6 

-.158 

.137 

3 

7 

-.102 

-.007 

4 

8 

+  .129 

-.048 

5 

9 

-.106 

-.094 

6 

40 

-.185 

.226 

7 

1 

-.001 

-.042 

8 

2 

-.014 

-.193 

9 

3 

-.092 

.032 

60 

4 

.029 

.071 

1 

5 

.012 

-.237 

2 

j 

hj  1 

h  j  2 

DUB 

59.235 

+14.807 

MOL 

71.587 

+  8.645 

PEO 

29.079 

-12.736 

SPR 

22.569 

-24.452 

gil 

gi2 

l 

gil 

gi2 

.014 

-.338 

1963 

-.114 

.154 

.023 

.003 

4 

-.058 

.095 

-.070 

-.058 

5 

.259 

.161 

-.051 

-.163 

6 

-.014 

.097 

-.068 

-.032 

7 

.130 

-.007 

.133 

.020 

8 

-.046 

.077 

-.149 

.118 

9 

.005 

-.095 

-.158 

.271 

70 

.300 

.057 

.000 

.156 

1 

.101 

.178 

-.154 

-.094 

2 

.235 

.156 

-.234 

-.145 

3 

.397 

-.020 

.017 

-.389 

4 

.269 

.035 

-.173 

-.195 

5 

.051 

-.067 

.115 

.078 

6 

-.127 

.211 

.064 

-.070 

7 

.227 

-.064 

.230 

.020 

8 

.019 

.031 

-.042 

.098 

A1  - 

100.50 

II 

CN 

63.22 

z\2  = 

16798.25 

Goodness  of  Fit  0.8392 


ST.L 


10.577  -54.246 


1 
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Figure  16:  GH' -biplot  of  50  Years'  Illinois  Rainfall  (g-points 
Identified  by  Years;  h-arrows  by  Station) 
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with  distance.  This  biplot  is  therefore  seen  to  provide  a  simple 
display  of  both  the  pattern  of  variation  and  the  configuration  of 
the  correlations. 

We  next  turn  to  the  scatter  of  g-points  in  Figure  16, 
which  displays  the  distribution  of  the  50  years  about  their 
5-variate  mean.  This  is  a  pretty  evenly  spread  scatter  —  no 
obvious  outliers  are  evident,  except  perhaps  1957  at  the  bottom 
of  the  biplot.  This  shows  unusually  high  precipitation  in  1957 
at  St.  Louis.  (Note  that  this  was  a  year  in  which  St.  Louis 

was  a  "seeding  target"!).  Indeed,  on  closer  examination,  we 
note  4  out  of  the  6  years  "seeded"  at  St.  Louis  to  have_ g-points 
pretty  far  out  in  the  direction  of  the  h-arrows  for  that  station. 
Also,  we  see  that  6  out  of  the  9  years  of  Northeast  Illinois 
"seeding" have  g-points  far  out  in  the  biplot  direction  of  the 
Dubuque  and  Moline  h-arrows.  This  is  suggestive  of  "seeding 
effects" . 

The  distinction  between  the  four  periods  may  be  accentuated 
by  suppressing  the  dates  on  the  biplot  and  substituting  the  number 
of  the  period,  i.e.,  1,  2,  3  or  4,  at  each  g-point.  This  is  done  in 
Figure  17.  This  display  emphasizes  the  predominance  of  g-points 
of  periods  II  and  IV  in  the  directions  of  the  h-arrows  for, 
respectively,  St.  Louis  and  Dubuque/Moline. 

Figure  18  is  another  version  of  this  same  GH' -biplot  in 
which  the  individual  years'  g-points  have  been  replaced  by 
concentration  ellipses  for  each  of  the  periods.  Now  the 
comparisons  are  much  easier  to  grasp.  The  average  level  of  Period  II 
precipitation  is  seen  to  be  highest  on  the  St.  Louis  target  and 
that  of  Period  IV  on  the  Dubuque  and  Moline  targets.  The  two 
unseeded  periods,  I  and  III,  have  fairly  similar  ellipses  which 


are  not  particularly  high  at  any  one  of  the  stations. 

Also  note  the  different  shapes  of  the  ellipses,  indicating 
^ifferences  in  variability.  The  elongation  of  the  Period  IV 
ellipse  along  hDUB  and  h^^  suggests  that  the  variance  in  North¬ 
eastern  Illinois  and  the  correlation  between  the  stations  must 
have  been  higher  in  the  period  when  it  was  the  "seeding  target". 
Similarly,  the  ellipse  for  Period  II  is  somewhat  elongated  along 
the  direction  of  hg,^.  That  indicates  that  when  St.  Louis  was 
being  "seeded"  its  variability  was  rather  high. 

Inspection  of  the  GH 1  biplot  of  the  entire  data  set  has 
revealed  differences  in  location  as'  well  as  in  variability  and 
correlations.  In  most  analyses  this  is  likely  to  be  the  single 
most  useful  display.  However,  we  will  also  illustrate  the  other 
two  displays:  The  set  of  batch  h-plots  which  is  designed 
specifically  for  comparisons  of  variability  and  correlation;  and 
the  MANOVA  biplot  which  displays  comparisons  of  means  standardized 
for  within  batch  variability. 

For  the  comparison  of  periods.  Table  20  gives  the  means, 
standard  deviations,  covariances  and  correlations  and  Table  21 
the  co-ordinates  for  the  h-plots  of  all  periods.  The  four 
periods'  h-plots  are  shown  together  in  Figure  19. 

The  h-plots  for  the  four  periods  —  Figure  19  — •  look 
rather  different  at  first  glance,  as  do  the  standard  deviations 
and  correlations  in  Table  21.  This  is  mostly  due  to  the  random 
variability  between  such  be. cch—  ci  G.-.ta:  It  is  well  known 

that  correlations  based  on  samples  of  as  few  as  6  and  9  observations 
fluctuate  wildly.  Indeed,  the  comparison  of  Periods  I  and  III 
which  were  both  "unseeded"  shows  how  large  random  variability 
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Table  20  Measures  of  Location  and  Dispersion  of  Four  Periods  of  Year 


Stations 


Means 

DUB 

PEO 

SPR 

ST.L 

I. 

1929-54 

33.558 

33.957 

35.116 

34.253 

36.135 

II. 

1955-60 

35.793 

30.868 

32.052 

35.463 

45.585 

III. 

1961-69 

45.002 

38.830 

33.758 

33.431 

34.967 

IV. 

1970-78 

45.919 

53.033 

38.112 

35.657 

36.123 

(above 

(below 

Standard  Deviations  (diagonal) ,  Covariances  diagonal) ,  Correlations  diagonal) 


Period  I  5.886 

19.998 

9.298 

8.738 

4.234 

.679 

5.000 

15.158 

15.019 

16.554 

.309 

.593 

5.115 

19.073 

21.562 

.250 

.506 

.628 

5.934 

I  35.309 

.083 

.380 

.484 

.683 

8.708 

Period  II  11.854 

100.791 

25.574 

32.601 

-25.972 

.973 

8.739 

24.873 

26.371 

-18.066 

.474 

.626 

4.550 

16.728 

15.964 

.620 

.680 

.829 

4.435 

15.637 

-.253 

-.238 

.405 

.407 

8.671 

Period  III  11.361 

57.407 

62.972 

34.954 

8.756 

.779 

6.487 

41.067 

22.489 

12.777 

.768 

.877 

7.215 

23.575 

2.621 

.842 

.949 

.895 

3.652 

8.416 

.135 

.344 

.063 

.403 

5.721 

Period  IV  6.999 

81.964 

23.608 

29.373 

26.095 

.840 

13.946 

|  70.461 

60.135 

32.450 

.453 

.678 

7.451 

43.173 

21.676 

.632 

.649 

.872 

6.645 

30.404 

.656 

.410 

.512 

.805 

5.682 

.1 
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Table  21  h-plot  Co-ordinates  for  the  Four  Periods 


Station 


Period 

I 

DUB 

MOL 

PEO 

SPR  STL 

hji 

10.9 

17.0 

18.5 

25.4  39.2 

X1 

= 

54.1 

hj2 

24.9 

15.9 

5.6 

0.5  -16.1 

X2 

= 

34.1 

Goodness 

of  Fit  =  .8236 

Period 

II 

hji 

26.3 

19.4 

5.5 

6.4  -  5.3 

X1 

= 

34.2 

hj2 

-  0.1 

-  0.9 

-  6.6 

6.5  -18.4 

X2 

= 

20.7 

Goodness 

of  Fit  =  .9596 

Period 

III 

hii 

30.9 

16.7 

18.3 

9.8  3.5 

X1 

= 

40.9 

"j2 

3.8 

-  4.1 

1.6 

2.4  -15.5 

X  2 

= 

16.7 

Goodness 

of  Fit  =  .9072 

Period 

IV 

V 

17.2 

38.2 

16.6 

15.2  9.6 

X1 

= 

48.5 

hj2 

2.7 

9.5 

-  9.4 

10.6  -  9.4 

X2 

= 

19.6 

Goodness  of  Fit  =  .9120 
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really  is:  Note  the  Dubugue-Springf ield  correlation  being  0.250 
in  Period  I  versus  0.842  in  Period  III!  This  illustration  should 
serve  as  a  warning  against  drawing  far  reaching  conclusions  about 
variability  and  correlation  on  the  basis  of  small  data  sets. 

Despite  the  smallness  of  the  samples,  there  is  some  consistency 
in  the  four  h-plots  of  Figure  19.  The  geographical  gradient  from 
Northeastern  through  Central  to  Southern  Illinois  is  shown  consistently 
in  all  periods  except  III  in  which  there  is  one  inversion  in  the 
geographical  order  —  between  Moline  and  Peoria.  The  orientation 
of  the  geographical  gradient  changes  from  period  to  period,  but 
the  gradient  persists,  illustrating  that  some  general  patterns 
may  be  revealed  even  from  small  samples  of  data. 

It  is  difficult  to  find  the  expected  "effects  of  seeding"  in 
these  displays.  "Target"  variability  was  expected  to  increase 
during  "seeding"  —  the  Moline  h~arrow  is  unusually  long  in 
Period  IV.  But  the  Dubuque  h -arrow  is  rather  short  in  Period  IV 
and  the  St.  Louis  h-arrow  is  not  particularly  long  in  Period  II. 

Nor  does  the  angle  between  h^g  and  h^Qg  seem  unusually  low  in 
Period  IV  —  as  it  should  have  been  if  "seeding"  had  increased 
correlation  between  "target"  stations.  Indeed,  if  we  check  back 
to  Table  20  we  note  that  these  "expected  effects"  did  not 
occur.  It  is  not  h-plot  display  that  obscured  them,  but  the 
magnitude  of  random  fluctuations  in  small  samples. 
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Table  22 


Estimate  of  "Error"  Variance  Based  on  Two  Periods 
Without  Operations 


STATIONS 

Stations 

DUB 

MOL 

PEO 

SPR 

STL 

DUB 

7.585 

29.067 

22.310 

15.094 

5.330 

MOL 

.692 

5.541 

1  21.439 

16.830 

15.638 

PEO 

.516 

.679  j 

5.696 

20.165 

16.970 

SPR 

.364 

.555 

.647 

- 1 

5.469 

28.789 

STL 

.087 

.349 

.368 

.651 

_ 1 

8.086 

NOTE:  Covariances  above  diagonal,  standard  deviations  in 
diagonal,  correlations  below  diagonal 


I 


k 
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Finally,  we  turn  to  the  comparison  of  means  —  shewn  above  in 
Table  20  —  as  standardized  by  random  variation  and  covariance 
That  is  the  multivariate  analysis  of  variance  (MANOVA)  approach. 
Standardization  is  effected  by  weighting  with  the  inverse  of  an 
estimated  variance-covariance.  The  usual  estimate  is  the  "within" 
matrix  of  variances  and  covariances.  (In  this  example  it  would  be 
estimated  by  pooling  the  bottom  four  panels  of  Table  20  with 
weights  25,  5  and  8,  respectively  to  a  total  of  46  degrees  of 
freedom  for  error) ,  but  in  the  present  instance  we  prefer  to 
pool  only  the  two  "unseeded"  periods  so  as  to  avoid  possible 
contamination  of  the  estimate  by  "seeding"  effect.  Thus,  we 
pooled  panels  I  and  III  of  Table  20  with  weights  25  and  8,  respectively, 
yielding  33  error  degrees  of  freedom  —  Table  22. 

The  MANOVA  calculations  are  shown  in  Table  23  and  the  corres¬ 
ponding  JK' -biplot  of  the  four  period  means  at  the  five  stations  is 
displayed  in  Figure  20.  Each  period  mean  is  surrounded  by  a 
"comparison  circle"  which  gives  an  idea  of  the  random  variability 
of  each  of  those  period  means.  (The  method  of  calculation  of  the 
radiuses  of  these  circles  is  shown  in  Table  23;  for  a  discussion  of  the 
rationale  of  these  methods  see  Gabriel,  1972) .  The  interpretation 
of  these  circles  is  simple.  Any  two  periods  whose  circles  intersect 
do  not  differ  more  than  expected  by  chance;  any  two  periods  whose 
circles  are  disjoint  differ  significantly,  i.e.,  more  than  expected 
by  chance.  In  this  application  chance  variability  is  read  as  95%  of 
random  variability  overall;  thus  a  5%  chance  —  level  of  significance  — 
is  allowed  of  finding  significance  on  some  pair  of  periods  that  does 
not  really  differ.  (Other  levels  could  be  chosen,  e.g.,  for  a  1% 
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Table  23  Calculations  for  MANOVA  and  JK' -biplot  of  Means 


X:  Batch  Means 


DUB 

MOL 

PEO 

SPR 

ST.L 

I 

33.558 

33.957 

35.116 

34.253 

36.135 

II 

35.793 

30.868 

32.052 

35.463 

45.585 

III 

45.002 

38.830 

33.758 

33.431 

34.967 

IV 

45.919 

53.033 

38.112 

35.657 

36.123 

S~1  =  Inverse  of  Error  Variances 


DUB 

MOL 

PEO 

SPR 

ST.L 

D 

.036  037 

-.032  422 

-.004  803 

-.004  409 

.008  005 

M 

-.032  422 

.092  402 

-.029  702 

-.007  462 

-.008  463 

P 

-.004  803 

-.029  702 

.072  456 

-.032  675 

.003  077 

S 

-.004  409 

-.007  462 

-.032  675 

.089  663 

-.028  855 

L 

.008  005 

-.008  463 

.003  077 

-.028  855 

.028  573 

N  = 

Dia 

gonal  Matrix  of  Sample 

Sizes 

• 

_ i 

I 

II  III  IV 

ni 

26 

6  9  9 

X 

' NXS-1 

r 

DUB 

MOL 

PEO 

SPR 

ST.L 

DUB  | 

-1. 

475  78 

102.070  99 

-46.893  78 

-16.250  12 

-8.097  50 

MOL  i 

-35. 

447  94 

188.119  76 

-57.863  27 

-21.508  86 

-23.885  08 

PEO  ! 

-13. 

071  80 

39.692  80 

-6.663  76 

-2.790  28 

-7.861  94 

SPR  | 

-2. 

942  10 

9.749  88 

-2.943  05 

-1.247  30 

0.224  44 

STL  l 

10. 

422  75 

-31.365  39 

1.239  63 

.365  76 

First 

Two 

Eigenvectors 

DUB  MOL 

PEO  SPR 

ST.L 

-1 ' 

-2.42  -4.16 

-0.84  -0.19 

0.72 

- 2 ' 

-4.62  -0.81 

1.26  0.06 

-1.71 

Column  Markers: 

k  k 

k  k 

k 

— D  — M 

-P  -S 

-L 

(These  eigenvectors  are  standardized  so  that  =  w2'S_1w2  =  1 

w’S_1w2  =  0)  . 


Critical  Value  0  =  0.420 


330/(1-0)  =  23.892 


This  is  the  upper  5%  point  of  the  maximum  characteristic  root 
distribution  for  5  variables,  4  samples  and  33  d.f  for  error 
Heck,  1960) . 


_ i 

Radius  of  Comparison  Circle  Z[330/(l-0) 2n^ ]  I  .679  1.411  1.152  1.152 


level  the  circles  would  be  larger  —  because  a  larger  8  would  be 
read  from  Heck’s  charts  —  and  fewer  differences  found  significant. 
That  would  be  a  safer,  but  less  revealing,  strategy) . 

The  comparison  circle  significance  tests  on  Figure  20  show 
Period  IV' s  means  to  differ  significantly  from  the  other  three 
periods'  means.  Periods  I  and  III  barely  differ,  and  Period  II 
does  not  quite  differ  significantly  from  either  of  these. 

The  scatter  of  the  four  period  means  —  j-points  —  can  be 
related  to  the  configuration  of  the  five  station  measurements  — 
h-arrows.  It  is  evident  from  Figure  20  that  Period  IV  had 
large  means  in  Northeastern  Illinois,  especially  in 

Moline  and  less  so  at  Dubuque.  This  confirms  the  "effect  of  seeding" 
in  that  area  in  Period  IV,  though  the  difference  between  Moline  and 
Dubuque  is  unexpected.  The  small,  and  non-significant,  difference 
between  Period  II  and  the  "unseeded"  Periods  I  and  III  is  mostly 
in  the  direction  of  ]iSTIj/  indicating  higher  precipitation  at  St. 

Louis  in  that  period  —  which  is  as  it  should  be  since  that  was 
where  "seeding"  was  carried  out.  The  other  small,  though  significant, 
difference  is  between  the  two  unseeded  periods,  I  and  III;  it  is  not 
quite  clear  what  this  is  due  to  and  it  may  well  be  a  "Type  I  error", 
i.e.,  a  falsely  significant  finding  when  no  true  difference  exists. 

It  is  evident  that  much  the  same  general  picture  c’"t r-,.r.rd 
from  the  comparison  of  means  on  the  MANOVA  JK' -biplot  of  Figure  20 
and  from  the  comparison  of  scatters  (ellipses  of  concentration)  in 
the  GH' -biplot  of  Figure  18.  Indeed,  both  these  biplots  are 
projections  of  the  data  matrix,  with  the  four  batches  of  points  and 
five  columns,  onto  different  two-dimensional  planes.  The  GH' -biplot 
describes  the  entire  variability  of  the  data,  whereas  the  JK' -biplot 
shows  only  the  scatter  of  means.  The  latter  therefore  emphasizes 
the  differences  between  the  periods  rather  than  what  they  have  in 
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common.  To  the  extent  that  the  latter  configuration  is  different  from 
the  former,  it  is  because  of  this  different  emphasis. 

It  must  be  remarked  that  the  approximate  significance  tests 
illustrated  in  Figure  20  are  valid  only  to  the  extent  that  the  required 
assumptions  are  satisfied.  These  are  (1)  multi-normal  distribution  of 
precipitation  in  the  five  stations,  (2)  equal  variances-covariances , 

(3)  independence  of  observations.  For  annual  precipitation  data 
(1)  may  be  a  reasonable  approximation  ~rd  (2)  would  probably  hold 
pretty  well  unless  seeding  effects  were  large.  Whether  successive 
annual  precipitation  amounts  are  independent  is  more  doubtful  — 
though  a  recent  study  (Gabriel  and  Petrondas,  in  preparation)  suggests 
that  assumption  (3)  is  not  seriously  wrong.  If  it  were  not  tenable, 
then  it  would  be  wrong  to  regard  the  four  periods  as  random  samples 
and  the  significance  tests  would  be  quite  invalid.  That  is  a  crucial 
point  in  many  meteorological  applications;  it  is  often  doubtful 
whether  successive  observations  can  be  considered  independent  and 
thus  the  application  of  significance  test  is  suspect.  The  emphasis 
in  this  chapter  was  therefore  on  exploratory  data  analysis  rather 
than  on  significance  testing  of  hypotheses  —  that  seems  to  be  of 
more  use  in  meteorological  research. 
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A  test  of  significance  provides  a  decision  on  whether 
to  regard  an  observed  phenomenon  as  "random"  or  real .  In 
other  words,  could  the  phenomenon  have  arisen  in  a  manner 
analogous  to  the  outcome  of  a  game  of  chance,  or  does  it 
reflect  a  real  pattern?  The  issue  of  randomness  versus 
real  effects  is  often  of  great  importance:  Are  there 
real  periodicities  in  precipitation,  trends  in  temperature, 
etc.,  and  could  the  claimed  effect  of  cloud  seeding  programs 
be  merely  due  to  chance?  The  use  of  significance  tests  to 
resolve  these  questions  is,  however,  not  as  straightforward 
as  might  be  thought.  A  few  words  on  this  topic  are  in 
place . 

Significance  tests  are  designed  to  disentangle  real 
from  random  effects .  They  do  so  by  checking  whether  the 
observations  seem  "non-random"  in  the  direction  in  which 
real  effects  are  a  priori  thought  likely  to  occur.  Thus, 
when  a  cloud  seeding  experiment  is  designed,  the  hypothesis 
of  no  effect  is  to  be  tested  against  that  of  augmented 
precipitation  subsequent  to  seeding.  When  this  expected 
effect  is  precisely  defined  in  terms  of  location  of  precip¬ 
itation,  time,  method  of  measurement,  etc.,  a  significance 
test  can  properly  be  applied. 
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Significance  tests  are  of  more  doubtful  validity  when 
they  are  applied  to  "effects'*  which  were  first  observed 
during  the  experiment  itself.  For  example,  the  Swiss 
Grossversuch  III  was  designed  to  reduce  hail  but  observa¬ 
tion  of  increased  rainfall  led  to  significance  testing  of 
augmented  precipitation.  It  is  a  common  occurrence  that 
apparent  effects  are  first  observed  in  a  particular  area 
or  at  a  particular  time,  e.g.,  after  some  change  in  seed¬ 
ing  protocol,  and  then  these  particular  "effects"  are 
tested  for  significance.  The  validity  of  such  testing  is 
often  in  doubt  because  it  does  not  take  into  account  the 
fact  that  the  one  most  striking  phenomenon  observed  on 
the  data  was  singled  out  for  testing.  A  multiplicity  of 
other  phenomena  were  not  tested  because  they  did  not  happen 
to  occur  in  such  a  striking  form  on  those  particular  data. 
Significance  tests  are  not  usually  designed  to  accomodate 
such  selection  of  effects  for  testing.  When  it  is  done, 
the  multiplicity  of  possible  choices  dilutes  the  significance 
and  leads  to  spuriously  "significant"  results. 

When  non-experimental  data  are  tested  for  significance, 
one  should  have  even  greater  concern  for  the  validity  of 
inferences.  Why  was  a  particular  phenomenon  chosen  for 
testing?  Surely  because  it  was  observed  to  be  remarkable. 

If  so,  the  results  of  significance  tests  are  strongly 
biased  in  favor  of  deciding  on  non-randomness.  Tests 
would  be  valid  only  if  carried  out  on  new  data  sets, 
independent  of  those  which  suggested  the  phenomenon. 
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A  convenient  terminology  is  that  distinguishing 
confirmatory  from  exploratory  analyses  (Tukey,  1977) . 

The  latter  are  essentially  inductive,  sifting  through 
data  for  leads,  patterns,  suggestions  and  ideas.  The 
former  are  of  a  more  deductive  and  rigidly  defined  char¬ 
acter  —  they  follow  a  protocol  laid  out  in  advance  for 
the  confirmation  or  refutation  of  a  particular  issue  -- 
as  in  the  prior  hypothesis  on  precipitation  to  be  confirmed 
or  rejected  by  a  cloud  seeding  experiment  (Gabriel,  1980) . 

No  doubt  there  is  much  more  exploration  than  confirma¬ 
tion  in  scientific  work,  especially  in  non-laboratory  situa 
tions.  And  these  are  common  in  meteorology.  Application 
of  significance  tests  in  exploratory  analyses  cannot  be 
regarded  as  a  rigorous,  well  defined  procedure:  At  best 
it  serves  to  give  vague  indications  of  the  relative  roles 


of  randomness  and  real  effects. 


7.2.  The  exploratory  nature  of  multivariate  analysis 

Multivariate  analysis,  be  definition,  deals  with  a 
multiplicity  of  measures,  none  of  which  has  been  identified 
as  the  unique  or  principal  bearer  of  the  information  sought. 

If  a  problem  were  closely  defined  and  circumscribed,  a  single 
variable  or  function  of  variables  would  have  been  likely  to 
emerge  as  the  measure  most  relevant  to  the  problem  at  hand. 

The  analysis  then  would  have  lost  its  multivariate  character. 

The  simultaneous  study  of  several  variables  thus  implies  that 
the  subject  is  not  narrowly  focused  and  a  definite  hypothesis 
about  the  phenomena  under  study  has  not  yet  emerged.  Hence 
multivariate  analyses  are  unlikely  to  be  confirmatory. 
Conversely,  a  confirmatory  study  is  most  likely  to  be 
univariate;  the  topic  to  be  tested  has  been  formulated  precisely 
and  allows  confirmation.  Exploratory  studies  are  often  multi¬ 
variate,  and  allow  the  investigator  to  search  for  effects  among 
a  multiplicity  of  variables. 
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7.3.  Significance  tests  in  multivariate  analysis 

We  have  argued  that  multivariate  analysis  is  mostly 
exploratory,  and  that  exploratory  studies  do  not  in  general 
depend  much  on  significance  tests.  Hence  the  role  of 
significance  testing  in  multivariate  analysis  is  likely  to 
be  minimal.  This  chapter  has  therefore  not  stressed 
topics  of  significance  testing.  Readers  who  still  wish 
to  apply  tests  of  significance  to  multivariate  data  are 
referred  to  Morrison's  (1976)  excellent  elementary  text  and 
to  Essenwanger ' s  (1976)  more  advanced  volume.  They  will 
find  tests  for  the  types  of  comparisons  discussed  in  Section  6 
as  well  as  for  other  types  of  multivariate  analyses  of  data 
from  Gaussian  distributions.  For  a  description  of  methods 
which  are  more  robust  against  non- normality,  readers  are 
referred  to  Gnanadesikan  (1977) .  The  present  author  hopes  that 
the  convenience  of  a  single  summary  or  significance  level  will 
not  deter  his  readers  from  exploring  their  data.  He  also  hopes 
that  the  present  chapter  may  help  his  readers  to  look  at  their 
data  and  discover  what  they  have  to  tell. 
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